=Paper=
{{Paper
|id=Vol-3856/paper_4
|storemode=property
|title=Neural Vicinal Risk Minimization: Noise-robust Distillation for Noisy Labels
|pdfUrl=https://ceur-ws.org/Vol-3856/paper_4.pdf
|volume=Vol-3856
|authors=Hyounguk Shon,Seunghee Koh,Yunho Jeon,Junmo Kim
|dblpUrl=https://dblp.org/rec/conf/aisafety/ShonKJ024
}}
==Neural Vicinal Risk Minimization: Noise-robust Distillation for Noisy Labels==
<pdf width="1500px">https://ceur-ws.org/Vol-3856/paper_4.pdf</pdf>
<pre>
                         Neural Vicinal Risk Minimization:
                         Noise-robust Distillation for Noisy Labels
                         Hyounguk Shon1 , Seunghee Koh1 , Yunho Jeon2 and Junmo Kim1,*
                         1
                             Korea Advanced Institute of Science and Technology (KAIST), 291 Daehak-ro, Yuseong-gu, Daejeon, 34141, South Korea
                         2
                             Hanbat National University, 125, Dongseo-daero, Yuseong-gu, Daejeon, 34158, South Korea


                                             Abstract
                                             Training deep neural networks with noisy supervision remains a challenging problem in weakly supervised learning. Mislabeled
                                             instances can severely degrade the generalization ability of classification models to unseen data. In this paper, we propose a novel
                                             regularization method coined Noise-robust Distillation (NRD) that addresses robust training under noisy supervision. NRD is motivated
                                             from a novel learning framework which we name Neural Vicinal Risk (NVR) minimization to improve the estimation quality of the data
                                             distribution and handle label noise effectively. Our framework is based upon our observation that a neural network has capability to
                                             correctly classify data sampled from vicinal distribution even when the model is overfitted to noisy label. By ensembling the predictions
                                             from the neural vicinal distribution, we obtain an accurate estimation of the class probabilities that reflects sample-wise class ambiguity.
                                             We validated our method through various noisy label benchmarks and demonstrate significant improvement in robustness to label noise.

                                             Keywords
                                             Learning with Label Noise, Vicinal Risk Minization, Noise-robust Loss


                                                                                                                                                       NoisyCIFAR-10-symm-50%
                         1. Introduction
                                                                                                                                                               Not transformed
                         Deep learning models have achieved remarkable success in                                                                              Transformed
                         various domains, including image classification, natural lan-                                                       Density          AUROC: 0.9935
                         guage processing, and speech recognition. However, the per-
                         formance of these models heavily relies on the availability of
                         high-quality labeled data for training. Obtaining accurately
                         annotated labels can be a challenging and time-consuming                                                                        2        1
                         task, often requiring human annotators to manually label
                         large amounts of data. As a result, noisy labels may arise
                         during the annotation process, leading to suboptimal model                                                                0         10       20   30
                         performance.                                                                                                                  GT class log-likelihood
                            In this paper, we address noisy label learning as a subset                                                    Figure 1: Averaging prediction over the novel views of a misla-
                         of a more generic type of problem. This encompasses learn-                                                       beled training instance effectively mitigates memorization. The
                         ing from an over-confident target probability distribution                                                       model is trained on the noisy training set and tested again using
                         and image ambiguity [1], human annotation errors, mul-                                                           the training examples. The histogram shows the distribution
                         tiple classes in an image, and out-of-distribution training                                                      of cross-entropy loss with respect to the GT labels. Red curve
                                                                                                                                          corresponds to standard prediction, and blue curve corresponds
                         examples [2] that can naturally occur due to, for example,
                                                                                                                                          to ensembling over transformation views. The right side shows a
                         random crop data augmentation. We show that our generic                                                          training example with its original view vs. the transformed novel
                         noisy label supervision algorithm can address a combination                                                      views. The corresponding loss is marked as “1” and “2” on the
                         of these issues using a simple and unified approach.                                                             histogram.
                            We propose a noise-robust learning algorithm named
                         Noise-Robust Distillation (NRD) to address the issue of noisy                                                    accurately model the vicinal distribution, indicating their
                         supervision during training. NRD aims to improve the gen-                                                        potential to correct the noisy supervision.
                         eralization performance of classification models by explicitly                                                      Our findings suggest that the combination of
                         considering the noise and ambiguity in the training labels.                                                      perturbation-based estimation and ensembling can
                         We motivate NRD by a novel formulation of the noisy su-                                                          lead to improved model performance, even in the presence
                         pervision learning problem which we name Neural Vicinal                                                          of noisy supervision. Building on these insights, we propose
                         Risk (NVR) minimization.                                                                                         Noise-Robust Distillation (NRD), which is a noise-robust
                            This stems from the observation that deep neural net-                                                         learning method that leverages the neural vicinal risk
                         works have the inherent capability to detect and correct                                                         principle to enhance the generalization performance of
                         noisy supervision, even when it is trained using noisy super-                                                    classification models trained on noisy labels.
                         vision. This ability is particularly evident when considering                                                       The main contributions of this work are as follows:
                         the vicinal distribution, which represents the distribution
                         generated from perturbed versions of the training data. De-                                                            • We introduce the Noise-Robust Distillation (NRD),
                         spite being trained on noisy labels, neural networks can still                                                           a noise-robust learning approach that comprehen-
                                                                                                                                                  sively addresses the challenges posed by noisy su-
                          The IJCAI-2024 AISafety Workshop, August 4, 2024, Jeju, South Korea                                                     pervision during training.
                         *
                            Corresponding author.                                                                                               • NRD is motivated by a novel noise-robust learn-
                          $ hyounguk.shon@kaist.ac.kr (H. Shon); seunghee1215@kaist.ac.kr                                                         ing framework which we name Neural Vicinal Risk
                          (S. Koh); yhjeon@hanbat.ac.kr (Y. Jeon); junmo.kim@kaist.ac.kr                                                          (NVR) minimization. We show that NVR improves
                          (J. Kim)
                           0000-0002-0867-1728 (H. Shon); 0009-0006-8662-0834 (S. Koh);
                                                                                                                                                  the estimation quality of the true class distribution
                          0000-0001-8043-480X (Y. Jeon); 0000-0002-7174-7932 (J. Kim)                                                             and handles label noise effectively.
                                     © 2024 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
                                     International (CC BY 4.0).                                                                                 • We demonstrate the ability of neural networks to

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                                              label, gt
          truck airplane                   truck airplane                       truck airplane                   truck airplane
                                                          label, gt
   ship              automobile     ship                 automobile     ship               automobile     ship              automobile

 horse                 bird       horse                    bird       horse                  bird       horse                    bird
                     label, gt
                                                                                                                                                                   Not perturbed, ECE=0.34
   frog              cat            frog                  cat           frog               cat            frog               cat                                   Perturbed, ECE=0.13
          dog deer                         dog deer                             dog deer                         dog deer                           1.0
                                                                                                                     label, gt
          truck airplane                   truck airplane                       truck airplane                   truck airplane                     0.8
                                                             gt         label
   ship              automobile     ship                 automobile     ship               automobile     ship              automobile


                                                                                                                                         Accuracy
                                                                                                                                                    0.6

 horse                 bird       horse                    bird       horse                  bird       horse                    bird               0.4
                        gt                                                                    gt        label
   frog              cat            frog                  cat           frog               cat            frog               cat                    0.2
                                                                                                                                 gt
          dog deer                         dog deer                             dog deer                         dog deer                            0
          label                                  label                                                                                                    0      0.2    0.4   0.6   0.8   1.0
                                                         Model tested with AutoAugment               Model tested with RandomCrop                                      Confidence
(a) Softmax predictions of clean instances (top row) and mislabeled instances (bottom row) from the noisy training                                            (b) Calibration plot
    set. Each marker indicates a softmax vector projected onto a 2D decagon.

Figure 2: On Figure 2a, model prediction of noisy samples are more sensitive to perturbation with respect to the input. NoisyCIFAR-10
dataset is used. Markers indicate the softmax scores predicted from the model trained using random crop augmentation. Red markers
(+) show predictions generated using the same augmentation policy used during training, and the blue markers (∙) are generated using
an unseen, stronger augmentation policy. The ten-class softmax scores are visualized by projecting onto a decagon using Equiradial
Projection [3]. On Figure 2b, while the model itself is heavily mis-calibrated (red bars), ensembling the predictions of the perturbed
inputs significantly improves the calibration. (blue bars)

       detect and correct mislabeled examples through sen-                                       tency regularization promotes a model to make consistent
       sitivity to perturbations in the input data, leading to                                   outputs across data augmentations, as in Π-model, Tempo-
       improved model predictions and calibration.                                               ral Ensembling [17] and Mean Teacher [18]. Also, FixMatch
     • We validate the effectiveness of NRD through ex-                                          [19] integrates pseudo-labeling and and virtual adversarial
       periments on benchmark datasets, showing clear                                            training [20] utilizes adversarial attacks. MixMatch [21],
       improvements in model performance in comparison                                           adopted by DivideMix [14], generates pseudo-label with
       to standard training methods under noisy supervi-                                         sharpening for data-augmented unlabeled examples and
       sion.                                                                                     mixes labeled and unlabeled data using MixUp [22].
                                                                                                    Calibration and knowledge distillation Confidence
                                                                                                 calibration [23] is the process of adjusting a model’s pre-
2. Related works                                                                                 dicted probabilities to better reflect the true likelihood. It is
                                                                                                 demonstrated that training a model with data augmentation
Noisy label learning Numerous methods tackle the chal-                                           like Mixup [22] improves model calibration and robustness
lenge of training Deep Neural Networks (DNNs) on datasets                                        to noise [24]. Meanwhile, Knowledge Distillation (KD) [25]
that contain a mix of correctly labeled and mislabeled sam-                                      enhances the student model by transferring knowledge con-
ples, as discussed in [4]. Some approaches focus on design-                                      tained in the prediction of the teacher model, focusing on
ing a noisy-robust loss to mitigate the impact of mislabeled                                     "dark" or "hidden" knowledge, including its confident and
samples. Mean Absolute Error (MAE) loss [5] demonstrates                                         less confident predictions.
competitive performance. Following this, the introduction
of the Generalized Cross-Entropy (GCE), Symmetric Cross-
Entropy (SCE) loss, and active passive loss are proposed with                                    3. Preliminaries
improved noisy-robustness. Generalized Jensen-Shannon
divergence (GJS) [6] enforces consistency between predic-                                        3.1. Notations
tions from multiple augmented views of a sample to regu-                                         Consider a DNN classification model parameterized by
larize training. Also, the principle of negative learning is                                     𝜃 ∈ Θ as 𝑓 (𝑥, 𝜃) : 𝒳 ↦→ ∆𝐶−1 which outputs a proba-
emphasized by [7, 8]. The strategies inspired by the train-                                      bility distribution 𝑃 (𝑦|𝑥; 𝜃). The input space is defined as
ing dynamics of models [9] such as early stopping [10, 11]                                       𝒳 = R𝐻×𝑊 ×𝐶 where 𝐻, 𝑊, 𝐶 are the number of height,
or over-parameterization [12] exploit the different conver-                                      width, and color channels of the image data. ∆𝑘 indicates
gence speeds of clean and noisy samples. Co-teaching [13]                                        𝑘-simplex. The model takes an image input 𝑥 ∈ 𝒳 and pre-
involves simultaneous training of two DNNs, where each                                           dicts a categorical distribution over 𝒴 = {1, 2, ..., 𝐶}. We
network learns from the clean samples chosen by its coun-                                        denote an image augmentation operation as 𝒯 (𝑥) : 𝒳 →
terpart. Noise identification aims to filter noisy samples                                       𝒳 , and the training dataset as 𝒟 = {(𝑥𝑖 , 𝑦𝑖 )}𝑖 . The loss
from the training dataset. Noisy samples can be filtered by                                      function is defined as ℓ(𝑥, 𝑦, 𝜃) : 𝒳 × 𝒴 × Θ ↦→ R. 𝛿(·) is
measuring the degree of disagreement between ensemble                                            the Dirac delta function and 1{·} is the indicator function.
models, which occurs once the model is overfitted to the
noisy samples. Recent algorithms [14, 15, 16] utilize the
power of Semi-Supervised Learning (SSL) by following a                                           3.2. Empirical Risk
two-step process: filtering out noisy labels first, and then                                     The expected risk 𝑅(𝜃) is defined as the average loss over
treating the detected noisy samples as unlabeled for reduc-                                      𝑝(𝑥, 𝑦),
ing the noisy learning problem into a SSL task.
   Semi-supervised learning (SSL) has emerged as a pow-
                                                                                                                    ∫︁
                                                                                                           𝑅(𝜃) =       ℓ(𝑥, 𝑦, 𝜃)𝑝(𝑥, 𝑦) 𝑑𝑥𝑑𝑦 .         (1)
erful method for noisy label learning. Among them, consis-                                                                       𝑥,𝑦
In practice, a dataset 𝒟 is used to mimic the true distribution    in knowledge distillation, which we view as an instance
𝑝(𝑥, 𝑦), which leads to the empirical risk                         of NER minimization. Knowledge distillation is known to
                      ∫︁                                           improve generalization and calibration performance due to
            𝑅ˆ (𝜃) =       ℓ(𝑥, 𝑦, 𝜃)𝑝
                                     ˆ(𝑥, 𝑦) 𝑑𝑥𝑑𝑦 .          (2)   the dark knowledge [25].
                        𝑥,𝑦                                           However, when the model is over-fitted to the noisy label,
                                                                   it severely degrades the performance of estimating the class
where the corresponding empirical distribution 𝑝    ˆ(𝑥, 𝑦) is     probabilities. Hence, in order to effectively utilize a neural
a mixture of delta masses using the observed samples, and          network, it is necessary to employ a noise-robust method to
the class distribution is a one-hot distribution given by an-      accurately estimate the class probabilities in the presence
notations,                                                         of noisy labels.
                              𝑛
                         1 ∑︁
            𝑝
            ˆ(𝑥, 𝑦) =          1{𝑦=𝑦𝑖 } 𝛿(𝑥 − 𝑥𝑖 ) .         (3)   3.4. Vicinal risk for noise-robust learning
                         𝑛 𝑖=1
                                                                   Our motivation is based on the Vicinal Risk Minimization
   Our goal is to refine the estimation of the data distribu-      (VRM) principle [26], which is an alternative approximation
tion 𝑝(𝑥, 𝑦) by utilizing the empirical distribution 𝑝
                                                     ˆ(𝑥, 𝑦).      to 𝑝(𝑥, 𝑦). The vicinal distribution 𝑝𝜈 (𝑥  ˜ , 𝑦˜) constructed
A pivotal question that arises is how to enhance the approx-       from the data distribution is defined as
imation of the true risk 𝑅(𝜃) intrinsic to a classification                              ∫︁
model. As evidenced by Equation (3), this task necessi-                    𝑝𝜈 (𝑥
                                                                               ˜ , 𝑦˜) =    𝜈(𝑥˜ , 𝑦˜|𝑥, 𝑦)𝑝(𝑥, 𝑦) 𝑑𝑥𝑑𝑦 .       (9)
tates the accurate estimation of two orthogonal components                                        𝑥,𝑦
present within the true distribution 𝑝(𝑥, 𝑦) = 𝑃 (𝑦|𝑥)𝑝(𝑥):
(1) the input distribution 𝑝(𝑥) and (2) the corresponding          where 𝜈(𝑥˜ , 𝑦˜|𝑥, 𝑦) is the vicinity distribution around (𝑥, 𝑦).
conditional distribution 𝑃 (𝑦|𝑥).                                  For example, [26] used additive Gaussian noise 𝒩 (0, 𝜎 2 𝐼).
                                                                   MixUp [24] and CutMix [27] chose stochastic interpolation
                                                                   between samples which has also shown its effectiveness in
3.3. Neural Empirical Risk                                         noisy label. [24]. Using the dataset, Equation (9) is replaced
Estimating 𝑃 (𝑦|𝑥) as a one-hot distribution involves as-          by the empirical distribution as
signing a single class label per sample, which is vulnerable                               ∫︁
to human annotation errors. Unfortunately, it proves chal-                  𝑝
                                                                            ˆ𝜈 (𝑥˜ , 𝑦˜) =      𝜈(𝑥˜ , 𝑦˜|𝑥, 𝑦)𝑝
                                                                                                               ˆ(𝑥, 𝑦) 𝑑𝑥𝑑𝑦    (10)
lenging to enhance or secure accurate supervision signals                                         𝑥,𝑦
                                                                                                     𝑛
for 𝑃 (𝑦|𝑥), as this requires multiple human annotators re-                                 1 ∑︁
                                                                                          =         ˜ , 𝑦˜|𝑥𝑖 , 𝑦𝑖 ) .
                                                                                                  𝜈(𝑥                                            (11)
viewing the same image [1] which is a prohibitively costly                                  𝑛 𝑖=1
process. Nonetheless, enhancing the estimation quality of
the true class distribution 𝑃 (𝑦|𝑥) can lead to further im-        Neural Vicinal Risk (NVR) We propose to further im-
provements in estimating and minimizing the true risk.             prove by using a neural network to robustly approximate
   Neural Empirical Risk (NER) Instead of using Equa-              the data distribution by modifying Equation (9). We propose
tion (3), we can choose to parameterize 𝑃 (𝑦|𝑥) by a neural        the following approximate vicinal data distribution parame-
network 𝑃 (𝑦|𝑥, 𝜑) to further improve the estimation qual-         terized by a deep neural network 𝜑 which we name neural
ity. First, we factorize the data distribution as 𝑝(𝑥, 𝑦) =        vicinal distribution 𝑝𝜋 .
𝑃 (𝑦|𝑥)𝑝(𝑥), and denote the corresponding empirical distri-          𝑝𝜋 (𝑥  ˜|𝒟) = 𝑃 (𝑦
                                                                         ˜, 𝑦         ˜|𝑥
                                                                                        ˜ ; 𝒟)𝑝(𝑥 ˜)                                             (12)
butions as follows:                                                                ∫︁                     ∫︁
                                     𝑛
                                                                                 =       ˜|𝑥
                                                                                      𝑃 (𝑦  ˜ , 𝜑)𝑑𝑝(𝜑|𝒟)      ˜ |𝑥)𝑑𝑝(𝑥)
                                                                                                             𝜈(𝑥                                 (13)
                                  1 ∑︁                                                    𝜑                            𝑥
                    𝑝
                    ˆ(𝑥) =              𝛿(𝑥 − 𝑥𝑖 )           (4)                     ∫︁                                    ∫︁
                                  𝑛 𝑖=1                                          ≈                 ˜ , 𝜑)𝑑𝛿(𝜑 − 𝜑* )
                                                                                                 ˜|𝑥
                                                                                              𝑃 (𝑦                                          ^(𝑥) (14)
                                                                                                                                      ˜ |𝑥)𝑑𝑝
                                                                                                                                    𝜈(𝑥
                                                                                          𝜑                                     𝑥
                ˆ (𝑦|𝑥𝑖 ) = 1{𝑦=𝑦 } .
                𝑃                                            (5)                                                𝑛
                                 𝑖
                                                                                                          1 ∑︁
                                                                                 = 𝑃 (𝑦 ˜ , 𝜑* )
                                                                                      ˜|𝑥                         ˜ |𝑥𝑖 )
                                                                                                                𝜈(𝑥                              (15)
Instead of using 𝑃ˆ (𝑦|𝑥), we choose to use a distribution                                                𝑛 𝑖=1
parameterized by a neural network trained on 𝒟,                                      1 ∑︁
                                                                                              𝑛
                                                                                 =         𝑃 (𝑦 ˜ , 𝜑* )𝜈(𝑥
                                                                                              ˜|𝑥         ˜ |𝑥𝑖 ) .                              (16)
                        ∫︁                                                           𝑛 𝑖=1
          𝑃 (𝑦|𝑥, 𝒟) =     𝑃 (𝑦|𝑥, 𝜑)𝑝(𝜑|𝒟) 𝑑𝜑 ,        (6)
                              𝜑
                                                                    Here, 𝜑* = arg min 𝑅   ˆ (𝜑) is the maximum-a-posteriori
where 𝑝(𝜑|𝒟) is the distribution over the function class           (MAP) model trained on 𝒟. It is important to note that
parameterized by neural network. By plugging Equation (6)          the samples from the vicinal distribution 𝜈(𝑥  ˜ 𝑖 |𝑥𝑖 ) is not
into 𝑝
     ˆ(𝑥, 𝑦) = 𝑃       ˆ(𝑥), we define the neural empirical
                ˆ (𝑦|𝑥)𝑝                                           shown at the training of the model 𝜑* . Equation (14) is
             ˆ𝜌 and the neural empirical risk 𝑅
distribution 𝑝                                ˆ 𝜌 as               given by substituting the Bayesian model with the MAP
                                                                   model and also replacing the true distribution 𝑝(𝑥) with the
      ˆ𝜌 (𝑥, 𝑦|𝒟) = 𝑃 (𝑦|𝑥, 𝒟)𝑝
      𝑝                        ˆ(𝑥)                          (7)   empirical distribution. The true neural vicinal distribution
                     ∫︁                                            is approximated by the ensembled MAP model predictions
           ˆ 𝜌 (𝜑) =
           𝑅             ℓ(𝑥, 𝑦, 𝜑)𝑝
                                   ˆ𝜌 (𝑥, 𝑦|𝒟) 𝑑𝑥𝑑𝑦 .        (8)   averaged over the samples from the vicinal distribution.
                        𝑥,𝑦                                           Therefore, we define the empirical neural vicinal distribu-
                                                                        ˆ𝜋 as,
                                                                   tion 𝑝
Here, we refer to the model 𝑃 (𝑦|𝑥, 𝜑) as the teacher network
to distinguish from the model being trained, whose term                                                     𝑛
                                                                                                        1 ∑︁
is borrowed from knowledge distillation. This can provide                  𝑝   ˜ , 𝑦˜; 𝜑* ) =
                                                                           ˆ𝜋 (𝑥                                    ˜ , 𝜑* )𝜈(𝑥
                                                                                                              𝑃 (𝑦˜|𝑥         ˜ |𝑥𝑖 ) .          (17)
                                                                                                        𝑛 𝑖=1
better estimation quality than 𝑃ˆ (𝑦|𝑥) as is often observed
                                    Teacher
                                  augmentation

                                                                           stop-grad


                                                              EMA
                                    Student                  update
                                  augmentation


Figure 3: Illustration of the proposed Noise-robust Distillation (NRD) architecture. 𝑥 is the input data to the neural network, and 𝑦
is the assigned target label. Red arrows show the gradient propagation path. During training, the predictions from the original views
(student augmentation) is regularized using the predictions generated from unseen views (teacher augmentation). We use asymmetric
augmentation policy so that the teacher augmentation generates novel views, and the stop-gradient operation ensures that the model
does not memorize the views generated from the teacher augmentation.


Table 1                                                                  To understand this phenomenon, we visualize the behav-
Label correction behavior for memorized training examples using       ior of the neural vicinal distribution in Figure 2a. Here,
transformed views. The models are trained to perfectly memo-          we compare the softmax scores from the augmented input
rize the noisy labels, then evaluated again for training set with     samples, distinguishing between the neural empirical dis-
ground-truth labels. Due to memorization, the GT accuracy for         tribution (red marker) and the neural vicinal distribution
mislabeled instances is zero and the overall accuracy is bounded      (blue markers). For the transformation policy, the network
by the noise rate. However, averaging the predictions from the
                                                                      was trained using random crop augmentation and AutoAug-
transformed inputs shifts the prediction of the noisy examples to
the ground-truth. For the transformation, AutoAugment followed        ment [28] is chosen as the vicinal distribution to generate
by RandomErasing was used. For the dataset, NoisyCIFAR-10             the novel views. The top row shows clean instances and the
with symmmetric noise was used.                                       bottom row shows mislabeled instances.
                                                                         The visual analysis contrasts the softmax predictions from
                          Training accuracy for GT labels (%)         both seen and novel views of clean and mislabeled training
                                                                      instances. The self-correction of the neural vicinal distribu-
     𝜂      Transform     Clean    Mislabeled      Overall
                                                                      tion is instance-dependent which responds differently based
    20%
                ×         99.99        0.01         81.93             on if an instance is clean or mislabeled. Notably, while the
                ○         96.01        54.72        88.55             teacher network’s predictions for the novel views tend to
                ×         99.97        0.07         55.11             shift misclassified predictions towards ground truth, they
    50%
                ○         93.09        40.58        69.51             remain consistent for clean samples. This suggests that the
                ×         99.83        0.06         28.10             network outputs corrected predictions by dissociating the
    80%                                                               novel views from the memorized views.
                ○         77.29        14.82        32.38
                                                                         Next, in Figure 1, we analyzed the label correction be-
                                                                      havior of the neural vicinal distribution over the dataset
Note that Equation (17) is a parameterized version of Equa-           population. Note that the models are trained only using
tion (11) using a deep neural network. Finally, the neural            the noisy training set, without access to the ground-truth
vicinal risk is,                                                      labels. Applying transformation (blue curve) significantly
                                                                      reduced the GT class cross-entropy loss compared to no
                                                                      transformation (red curve), and we observed a good sepa-
                   ∫︁
        𝑅ˆ 𝜋 (𝜑) =    ℓ(𝑥, 𝑦, 𝜑)𝑝    ˜ , 𝑦˜; 𝜑* ) 𝑑𝑥𝑑𝑦 .
                                 ˆ𝜋 (𝑥                   (18)
                    𝑥,𝑦                                               ration between the two distributions. Also, Table 1 shows
                                                                      the ground-truth accuracy for the training samples where
Note that 𝜈(𝑥˜ |𝑥) is distinct from the augmentation strategy         we observed significant improvements for the mislabeled
applied to the model being trained. Similar to Equation (8),          instances when transformation is applied.
we refer to the 𝜈(𝑥 ˜ |𝑥) as teacher augmentation.                       We additionally observed that ensembling perturbed pre-
                                                                      dictions enhances the calibration, as depicted in Figure 2b.
3.5. Self-correction for memorized instances                          While the original model is heavily over-confident due to
                                                                      overfitting (red), vicinal prediction improves accuracy and
We further discuss the behavior of the neural vicinal dis-            reflects class ambiguities. (blue)
tribution over a noisy training dataset. Notably, when a
training dataset includes mislabeled instances, a teacher
neural network can overfit to these noisy labels, where the           4. Method
neural empirical risk minimization fails in mitigating the
impact. Interestingly, we observe that the neural vicinal dis-        Motivated from the observation in Section 3.5, we propose a
tribution exhibits robustness against label noise, effectively        novel learning method for noisy labels named Noise-robust
self-correcting incoherent labels within the training set.            Distillation (NRD). Our method is formulated as a simple
loss function which makes it easy to employ in existing         Algorithm 1 PyTorch-style pseudocode
training pipeline.
                                                                ema_model = ema(model)
    For this, we combine the target loss with the neural vic-   optimizer = sgd_optimizer(model)
inal risk loss as a regularization objective. We formulate
the combined objectives into a triplet loss. We have found      for x, y in dataloader:
                                                                   x_t = teacher_aug(x)
Jensen-Shannon divergence (JSD) to be effective which gen-         x_s = student_aug(x)
eralizes to a triplet loss. The JSD for three distributions
is,                                                                # disconnect from backprop
                                                                   y_t = ema_model(x_t).detach()
                                 ∑︁                                y_s = model(x_s)
          JSD𝜋 (p1 , p2 , p3 ) =    𝜋𝑖 𝐷KL (pi ||m) ,    (19)
                                 𝑖                                 # distance between predictions
                                                                   loss = js_div(y, y_s, y_t)
                                                                   loss.backward()
where m =        𝑖 𝜋𝑖 pi . The hyperparameter 𝜋 ∈ ∆ is
                                                      2
              ∑︀
                                                                   optimizer.step()
chosen to balance the importance weight between the dis-           ema_model.update()
tributions. Additionally, JS divergence is known to have a
nice robustness property against label noise. [6] showed
that JS divergence simulates MAE loss [5] in its asymptote.
   Next, we derive our NRD objective step-by-step. By ap-       backpropagation graph, which prevents the model from
plying NVR to JSD loss, we have                                 memorizing the teacher augmentation views. (stop-grad in
                                                                Figure 3) For Equation (27), we found single sample per SGD
             ℒ(𝜃; 𝑥, y, 𝜑) = JSD𝜋 (y, ys , y𝑡 )         (20)    step was sufficient. The overall architecture is illustrated in
                        ys = 𝑓 (𝑥, 𝜃)                   (21)    Figure 3 and the pseudocode is presented in Algorithm 1.
                        yt =     E [𝑓 (𝑥
                                       ˜ , 𝜑)] ,        (22)
                                 ˜ |𝑥)
                               𝜈(𝑥
                                                                5. Experiments
assuming that we have a trained teacher network 𝜑. Here,
y is the target label, ys is the model output and yt            5.1. Experimental settings
is the teacher network output. The loss is solved for           Benchmarking datasets For synthetic label noise bench-
min𝜃 ℒ(y, ys , yt ).                                            marks, we used NoisyCIFAR-10, NoisyCIFAR-100 [29]. For
   To improve noise-robustness, we can further employ an        symmetric label noise, we randomly flip the ground truth
iterative distillation scheme which we repeat the strategy      label with a probability 𝜂 uniformly across all categories.
for multiple rounds of training. We set the teacher network     For asymmetric label noise, we follow the scheme in [30].
as the model obtained from the previous training round,         For NoisyCIFAR-10-asymm, we flip truck→automobile,
such that 𝜑𝑡 = 𝜃𝑡−1 at the 𝑡-th training round. Applying to     bird→airplane, cat→dog, dog→cat, deer→horse. For
Equation (20),                                                  NoisyCIFAR-100-asymm, within each superclass, we ran-
                                                                domly replace a subclass label 𝑦𝑖 to adjacent subclass 𝑦𝑖 + 1
              𝜃𝑡 = arg min ℒ(𝜃; 𝑥, y, 𝜃𝑡−1 ) .          (23)
                       𝜃                                        with probability 𝜂.
                                                                   For the real-world benchmark, we used WebVision [31]
A student network obtained from previous training round         dataset. WebVision consists of 2.4M training examples col-
is switched to the teacher role for next round. However,        lected via Google and Flickr image search. We used a minia-
in practice, we found this to be unstable and difficult to      turized training set following [32] which uses only the first
converge. Instead, we take the exponential average of the       50 categories in the “Google” image set. Mini-WebVision
historical models as the teacher and set 𝜑𝑡 = ¯
                                              𝜃𝑡−1 .            consists of 66K training and 2.5K validation examples. We
                                                                additionally evaluated the trained model on ImageNet [33]
              ¯
              𝜃𝑡 = 𝛽 · ¯
                       𝜃𝑡−1 + (1 − 𝛽) · 𝜃𝑡 .            (24)    validation set. The noise rate is known to be around 20%.
                                                                Baseline methods For the CIFAR benchmarks, we com-
For the decay rate, we simply set 𝛽 = 0.99 for all exper-
                                                                pare against cross-entropy (CE), bootstrapping (BS) [34],
iments. The aggregation reduces the variance of neural
                                                                label smoothing (LS) [35], symmetric cross-entropy (SCE)
vicinal risk estimation caused by stochastic gradient, and
                                                                [36], generalized cross-entropy (GCE) [37], normalized loss
we have empirically found that it effectively stabilizes the
                                                                (NCE+RCE) [38], Jensen-Shannon divergence (JS, GJS) [6].
training and lead to faster convergence.
                                                                   For the WebVision benchmarks, we compared our method
   Finally, we formally define our NRD training objective.
                                                                with the state-of-the-art methods including ELR+ [10], Di-
To reduce the training cost, we simplify each training round
                                                                videMix [14], and GJS [6]. The baseline results were adopted
into a single step of stochastic gradient descent. (SGD)
                                                                from [6].
This simplifies the algorithm from multi-staged process into
                                                                Models PreActResNet-34 architecture [39] is used for all
a single-staged process, and significantly accelerates the
                                                                experiments conducted on CIFAR-10/100 datasets. For We-
training. The NRD objective is,
                                                                bVision experiments, we used ResNet-50. All experiments
           ℒNRD (𝜃; 𝑥, y, ¯
                          𝜃) = JSD𝜋 (y, ys , yt )       (25)    were trained from random initialization.
                                                                Augmentation policy For the CIFAR experiments, we fol-
                           ys = 𝑓 (𝑥, 𝜃)                (26)    lowed [6] and used RandAugment [40] chained with Cutout
                                          ˜, ¯          (27)    [41] for all methods. For the NRD teacher transformation,
                                       [︀      ]︀
                           yt = E 𝑓 (𝑥       𝜃) ,
                                  ˜ |𝑥)
                                𝜈(𝑥
                                                                we used AugMix [42] in all experiments.
                                                                Hyperparameters For CIFAR-10/100 benchmarks, we used
with a slight abuse of notation for ¯
                                    𝜃, which is not an op-
                                                                400 epochs for each training. We used SGD optimizer with
timization variable but continuously updated after each
                                                                momentum 0.9 and weight decay of 10−4 . Learning rates
SGD step. This is implemented by detaching yt from the
     Table 2
     Noisy label performance on synthetic noisy label benchmarks. We used NoisyCIFAR-10 and NoisyCIFAR-100 datasets. Values
     indicate clean test accuracy. All values are averaged over five independent runs. The best and second best results are highlighted
     in bold.

                                               NoisyCIFAR-10                                                                    NoisyCIFAR-100
                                     Symmetric                        Asymmetric                                       Symmetric                              Asymmetric
       Noise rate      20%          40%    60%              80%       20%   40%                          20%          40%    60%                80%           20%   40%
         CE           91.63         87.74       81.99       66.51    92.77   87.12                   65.74           55.77       44.42         10.74          66.85      49.45
         BS           91.68         89.23       82.65       16.97    93.06   88.87                   72.92           68.52       53.80         13.83          73.79      64.67
         LS           93.51         89.90       83.96       67.35    92.94   88.10                   74.88           68.41       54.58         26.98          73.17      57.20
        SCE           94.29         92.72       89.26       80.68    93.48   84.98                   74.21           68.23       59.28         26.80          70.86      51.12
        GCE           94.24         92.82       89.37       79.19    92.83   87.00                   75.02           71.54       65.21         49.68          72.13      51.50
      NCE+RCE         94.27         92.03       87.30       77.89    93.87   86.83                   72.39           68.79       62.18         31.63          71.35      57.80
         JS           94.52         93.01       89.64       76.06    92.18   87.99                   75.41           71.12       64.36         45.05          71.70      49.36
        GJS           95.33         93.57       91.64       79.11    93.94   89.65                   78.05           75.71       70.15         44.49          74.60      63.70
      NRD (ours)      95.43         94.65       92.45       85.32    93.90   91.25                   78.54           76.29       72.43         60.01          76.07      61.40


Table 3                                                                         Table 4
Real-world noisy label benchmark on WebVision. The models                       Performance on clean CIFAR-10 and CIFAR-100 datasets. The
are trained using the WebVision training set, and evaluated on                  values indicate test accuracy.
WebVision and ImageNet validation sets. The values indicate
accuracy. IRNv2 and RN50 indicates Inception-ResNet-V2 and                                                           Method             CIFAR-10        CIFAR-100
ResNet-50, respectively. 𝑁 indicates the number of networks
used.                                                                                                              CE                       94.35             77.60
                                                                                                                  GCE                       94.00             77.65
                                             WebVision         ImageNet                                           GJS                       94.78             79.27
   Method     Arch.      Aug.        𝑁      Top-1   Top-5    Top-1   Top-5                                      NRD (ours)                  95.05             79.61
    ELR+      IRNv2                         77.78   91.68    70.29   89.76
  DivideMix   IRNv2    MixUp         2      77.32   91.64    75.20   90.84                      90
  DivideMix   RN50                          76.32   90.65    74.42   91.21                                                                                    Method and Noise Rate ( )
                                                                                                80                                                                      GJS =0.2
    CE        RN50                          70.69   88.64    67.32   88.00                                                                                              GJS =0.4
    JS        RN50                          74.56   91.09    70.36   90.60                      70                                                                      GJS =0.6
   GJS        RN50
                      ColorJitter    1
                                            77.99   90.62    74.33   90.33
                                                                                                                                                                        GJS =0.8
                                                                                                60                                                                      NRD =0.2
 NRD (ours)   RN50                          78.56   92.48    75.24   92.36                                                                                              NRD =0.4
                                                                                Test Accuracy


                                                                                                50                                                                      NRD =0.6
                                                                                                                                                                        NRD =0.8
                                                                                                40
were reduced by a factor of 0.1 after 200-th and 300-th epoch.                                  30
For WebVision benchmarks, we trained the network for 300                                        20
epochs. Learning rate was reduced by a factor of 0.1 after 150                                  10
and 250 epochs. Refer to Appendix A for hyperparameter
                                                                                                 0
configuration details.                                                                               0         200      400           600      800     1000
                                                                                                                              Epoch

5.2. Results                                                                    Figure 4: Comparison of overfitting behavior in consistency reg-
                                                                                ularization (GJS [6]). Enforcing consistency does not fully prevent
Performance on noisy label benchmarks In Table 2,                               overfitting because the model memorizes the noisy labels after
we show the performance of our method in comparison to                          an extended number of epochs. In contrast, our method (NRD)
robust loss functions. While most of the baselines shows                        effectively prevents memorization. NoisyCIFAR-100 dataset is
inconsistent performance between symmetric and asymmet-                         used.
ric noise types, our method shows consistent improvement
across a wide range of noise rates and noise types. No-
tably, we significantly improve performance under high                          a type of noisy supervision signal. We show that applying
noise rate settings where GJS tend to underperform. For                         NRD can regularize and improve the performance of the
NoisyCIFAR-10 80% noise, we improve by 5%p over SCE,                            model.
and for NoisyCIFAR-100-80%, we improve by 10%p over                             Comparison to consistency regularization Consistency
GCE.                                                                            regularization used in GJS is a powerful technique for noise-
   Furthermore, the results on large-scale real-world noisy                     robustness. While it is similar to NRD, however, it does
label benchmark is shown in Table 3. Notably, we observed                       not directly prevent memorization of noisy labels. Figure 4
that our method outperforms over existing methods that                          shows that GJS suffers from overfitting when trained for an
uses two networks.                                                              extended number of steps. This is shown by test accuracy
Performance on clean datasets The proposed method im-                           decreasing after reaching a peak at an early epoch. In con-
proves model generalization when applied to clean dataset                       trast, NRD significantly mitigates overfitting. Notably, in
training as seen in Table 4. This is because the training                       80% noise rate setting, we improve GJS by 36%p. The key
dataset contains visually ambiguous images that make it                         contributing factor is that our method uses stop-gradient
difficult to draw a clear decision boundary, and therefore                      which directly prevents the model from memorizing the
the hard target distributions from the annotations serve as                     views generated by the asymmetric augmentation policy.
           1.0                                                  1.0                                                  1.0
                        CE, ECE=0.08                                         CE, ECE=0.31                                         CE, ECE=0.63                      IEEE Transactions on Neural Networks and Learning
           0.8          CE+NRD, ECE=0.09                        0.8          CE+NRD, ECE=0.13                        0.8          CE+NRD, ECE=0.13
                                                                                                                                                                    Systems (2022).
Accuracy


                                                     Accuracy


                                                                                                          Accuracy
           0.6                                                  0.6                                                  0.6

           0.4                                                  0.4                                                  0.4
                                                                                                                                                                [5] A. Ghosh, H. Kumar, P. S. Sastry, Robust loss func-
           0.2                                                  0.2                                                  0.2
                                                                                                                                                                    tions under label noise for deep neural networks, in:
            0                                                    0                                                    0
                                                                                                                                                                    Proceedings of the Thirty-First AAAI Conference on
                 0    0.2    0.4   0.6   0.8   1.0                    0    0.2    0.4   0.6   0.8   1.0                    0    0.2    0.4   0.6   0.8   1.0
                            Confidence                                           Confidence                                           Confidence                    Artificial Intelligence, 2017.
                                                                                                                                                                [6] E. Englesson, H. Azizpour, Generalized jensen-
                     (a) 𝜂 = 0.2                                          (b) 𝜂 = 0.5                                          (c) 𝜂 = 0.8
                                                                                                                                                                    shannon divergence loss for learning with noisy labels,
Figure 5: Comparison of CE baseline and NRD models trained on                                                                                                       Advances in Neural Information Processing Systems
NoisyCIFAR-10 with symmetric label noise, where 𝜂 denotes the                                                                                                       34 (2021) 30284–30297.
noise rate. Expected calibration error (ECE) is measured. NRD                                                                                                   [7] Y. Kim, J. Yim, J. Yun, J. Kim, Nlnl: Negative learning
training consistently improves calibration across a range of noise                                                                                                  for noisy labels, in: Proceedings of the IEEE/CVF
rates.
                                                                                                                                                                    International Conference on Computer Vision (ICCV),
                                                                                                                                                                    2019.
                                                                                                                                                                [8] Y. Kim, J. Yun, H. Shon, J. Kim, Joint negative and
Confidence calibration In Figure 5, we additionally eval-                                                                                                           positive learning for noisy labels, in: Proceedings of
uated the calibration performance. We observed that the                                                                                                             the IEEE/CVF Conference on Computer Vision and
regularization effect from NRD also improves calibration                                                                                                            Pattern Recognition (CVPR), 2021, pp. 9442–9451.
of the model. Our method shows consistent calibration                                                                                                           [9] D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Ben-
performance across all noise rates, which aligns with the                                                                                                           gio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville,
performance of our method.                                                                                                                                          Y. Bengio, S. Lacoste-Julien, A closer look at memo-
                                                                                                                                                                    rization in deep networks, in: Proceedings of the 34th
                                                                                                                                                                    International Conference on Machine Learning, vol-
6. Conclusion                                                                                                                                                       ume 70 of Proceedings of Machine Learning Research,
Our work proposes Noise-Robust Distillation (NRD) which                                                                                                             PMLR, 2017, pp. 233–242.
is a simple regularization objective that is designed to im-                                                                                                   [10] S. Liu, J. Niles-Weed, N. Razavian, C. Fernandez-
prove a wide range of noisy supervision problems in training.                                                                                                       Granda, Early-learning regularization prevents mem-
We motivate our method based on the novel formulation                                                                                                               orization of noisy labels, in: Advances in Neural In-
of Neural Vicinal Risk (NVR) minimization, which focuses                                                                                                            formation Processing Systems, volume 33, 2020, pp.
on leveraging deep neural networks to improve empirical                                                                                                             20331–20342.
risk minimization under noisy supervision scenarios. A key                                                                                                     [11] Y. Bai, E. Yang, B. Han, Y. Yang, J. Li, Y. Mao, G. Niu,
insight of our work is the inherent capacity of deep neural                                                                                                         T. Liu, Understanding and improving early stopping
networks to detect and correct mislabeled examples based                                                                                                            for learning with noisy labels, in: Advances in Neural
on vicinal distribution, a feature we exploited to improve                                                                                                          Information Processing Systems, volume 34, 2021, pp.
model predictions and calibration. We have validated our                                                                                                            24392–24403.
method on several noisy label learning benchmarks. The                                                                                                         [12] S. Liu, Z. Zhu, Q. Qu, C. You, Robust training under la-
results show clear improvements in performance compared                                                                                                             bel noise by over-parameterization, in: Proceedings of
to the baselines under noisy supervision. These findings                                                                                                            the 39th International Conference on Machine Learn-
suggest that NRD offers an effective strategy for handling                                                                                                          ing, volume 162 of Proceedings of Machine Learning
noisy supervision, leading to enhanced generalization per-                                                                                                          Research, PMLR, 2022, pp. 14153–14172.
formance of classification models.                                                                                                                             [13] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang,
   Acknowledgement This work was supported by the                                                                                                                   M. Sugiyama, Co-teaching: Robust training of deep
National Research Foundation of Korea(NRF) grant funded                                                                                                             neural networks with extremely noisy labels, in: Ad-
by the Korea government(MSIT) (No. RS-2023-00240379).                                                                                                               vances in Neural Information Processing Systems, vol-
                                                                                                                                                                    ume 31, 2018.
                                                                                                                                                               [14] J. Li, R. Socher, S. C. Hoi, Dividemix: Learning with
References                                                                                                                                                          noisy labels as semi-supervised learning, in: Interna-
                                                                                                                                                                    tional Conference on Learning Representations, 2020.
  [1] L. Schmarje, V. Grossmann, C. Zelenka, S. Dippel,                                                                                                        [15] J. Li, G. Li, F. Liu, Y. Yu, Neighborhood collective
      R. Kiko, M. Oszust, M. Pastell, J. Stracke, A. Valros,                                                                                                        estimation for noisy label identification and correction,
      N. Volkmann, et al., Is one annotation enough?-a data-                                                                                                        in: European Conference on Computer Vision, 2022.
      centric image classification benchmark for noisy and                                                                                                     [16] G. Zhao, G. Li, Y. Qin, F. Liu, Y. Yu, Centrality and
      ambiguous label estimation, in: Thirty-sixth Con-                                                                                                             consistency: Two-stage clean samples identification
      ference on Neural Information Processing Systems                                                                                                              for learning with instance-dependent noisy labels, in:
      Datasets and Benchmarks Track, 2022.                                                                                                                          European Conference on Computer Vision, volume
  [2] S. Yun, S. J. Oh, B. Heo, D. Han, J. Choe, S. Chun, Re-                                                                                                       13685, 2022, pp. 21–37.
      labeling imagenet: from single to multi-labels, from                                                                                                     [17] S. Laine, T. Aila, Temporal ensembling for semi-
      global to localized labels, in: Proceedings of the                                                                                                            supervised learning, in: International Conference
      IEEE/CVF Conference on Computer Vision and Pat-                                                                                                               on Learning Representations, 2017. URL: https://
      tern Recognition, 2021, pp. 2340–2350.                                                                                                                        openreview.net/forum?id=BJ6oOfqge.
  [3] C.    Lehman,        Visualizing     softmax,     2019.                                                                                                  [18] A. Tarvainen, H. Valpola, Mean teachers are better role
      URL:             https://charlielehman.github.io/post/                                                                                                        models: Weight-averaged consistency targets improve
      visualizing-tempscaling/.                                                                                                                                     semi-supervised deep learning results, in: Advances
  [4] H. Song, M. Kim, D. Park, Y. Shin, J.-G. Lee, Learning                                                                                                        in Neural Information Processing Systems, volume 30,
      from noisy labels with deep neural networks: A survey,                                                                                                        2017.
[19] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang,           Conference on Machine Learning, PMLR, 2020, pp.
     C. A. Raffel, E. D. Cubuk, A. Kurakin, C.-L. Li, Fix-            6448–6458.
     match: Simplifying semi-supervised learning with con-       [36] Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, J. Bailey, Sym-
     sistency and confidence, in: Advances in Neural In-              metric cross entropy for robust learning with noisy
     formation Processing Systems, volume 33, 2020, pp.               labels, in: Proceedings of the IEEE/CVF International
     596–608.                                                         Conference on Computer Vision (ICCV), 2019.
[20] T. Miyato, A. M. Dai, I. Goodfellow, Adversarial train-     [37] Z. Zhang, M. Sabuncu, Generalized cross entropy loss
     ing methods for semi-supervised text classification,             for training deep neural networks with noisy labels, in:
     in: International Conference on Learning Representa-             Advances in Neural Information Processing Systems,
     tions, 2017.                                                     volume 31, 2018.
[21] D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot,       [38] X. Ma, H. Huang, Y. Wang, S. Romano, S. Erfani, J. Bai-
     A. Oliver, C. A. Raffel, Mixmatch: A holistic approach           ley, Normalized loss functions for deep learning with
     to semi-supervised learning, in: Advances in Neural              noisy labels, in: Proceedings of the 37th International
     Information Processing Systems, volume 32, 2019.                 Conference on Machine Learning, volume 119 of Pro-
[22] H. Zhang, M. Cisse, Y. N. Dauphin, D. Lopez-Paz,                 ceedings of Machine Learning Research, PMLR, 2020,
     mixup: Beyond empirical risk minimization, in: In-               pp. 6543–6553.
     ternational Conference on Learning Representations,         [39] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning
     2018.                                                            for image recognition, in: CVPR, 2016.
[23] C. Guo, G. Pleiss, Y. Sun, K. Q. Weinberger, On calibra-    [40] E. D. Cubuk, B. Zoph, J. Shlens, Q. V. Le, Randaug-
     tion of modern neural networks, in: International con-           ment: Practical automated data augmentation with a
     ference on machine learning, PMLR, 2017, pp. 1321–               reduced search space. 2020 ieee, in: CVF Conference
     1330.                                                            on Computer Vision and Pattern Recognition Work-
[24] S. Thulasidasan, G. Chennupati, J. A. Bilmes, T. Bhat-           shops (CVPRW), 2019, pp. 3008–3017.
     tacharya, S. Michalak, On mixup training: Improved          [41] T. DeVries, G. W. Taylor, Improved regularization of
     calibration and predictive uncertainty for deep neu-             convolutional neural networks with cutout, arXiv
     ral networks, in: Advances in Neural Information                 preprint arXiv:1708.04552 (2017).
     Processing Systems, volume 32, 2019.                        [42] D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer,
[25] G. Hinton, O. Vinyals, J. Dean, Distilling the knowl-            B. Lakshminarayanan, AugMix: A simple data pro-
     edge in a neural network, 2015. arXiv:1503.02531.                cessing method to improve robustness and uncertainty,
[26] O. Chapelle, J. Weston, L. Bottou, V. Vapnik, Vicinal            Proceedings of the International Conference on Learn-
     risk minimization, Advances in neural information                ing Representations (ICLR) (2020).
     processing systems 13 (2000).                               [43] K. He, X. Zhang, S. Ren, J. Sun, Identity mappings in
[27] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, Y. Yoo, Cut-         deep residual networks, in: Computer Vision–ECCV
     mix: Regularization strategy to train strong classi-             2016: 14th European Conference, Amsterdam, The
     fiers with localizable features, in: Proceedings of the          Netherlands, October 11–14, 2016, Proceedings, Part
     IEEE/CVF international conference on computer vi-                IV 14, Springer, 2016, pp. 630–645.
     sion, 2019, pp. 6023–6032.                                  [44] T. maintainers, contributors, Torchvision: Pytorch’s
[28] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, Q. V. Le,           computer vision library, https://github.com/pytorch/
     Autoaugment: Learning augmentation strategies from               vision, 2016.
     data, in: Proceedings of the IEEE/CVF Conference
     on Computer Vision and Pattern Recognition (CVPR),
     2019.
[29] A. Krizhevsky, Learning multiple layers of features
                                                                 A. Detailed hyperparameter
     from tiny images, Technical Report, 2009.                      configurations
[30] G. Patrini, A. Rozza, A. Menon, R. Nock, L. Qu, Making
     neural networks robust to label noise: a loss correction    A.1. CIFAR-10/100 benchmarks
     approach, stat 1050 (2016) 13.
                                                                 General training details For the network architecture,
[31] W. Li, L. Wang, W. Li, E. Agustsson, L. Van Gool, We-
                                                                 we use PreActResNet-34 [43]. For training, we use SGD
     bvision database: Visual learning and understanding
                                                                 optimizer with momentum 0.9, a batch size of 128, and train
     from web data, arXiv preprint arXiv:1708.02862 (2017).
                                                                 for 400 epochs. The learning rate is reduced by 1/10 at 50%
[32] P. Chen, B. B. Liao, G. Chen, S. Zhang, Understand-
                                                                 and 75% of the training iterations.
     ing and utilizing deep neural networks trained with
                                                                 Augmentation policy For data augmentation, we use Ran-
     noisy labels, in: International Conference on Machine
                                                                 dAugment [40] with 𝑁 = 1, 𝑀 = 3 followed by random
     Learning, PMLR, 2019, pp. 1062–1070.
                                                                 crop (size 32 and 4-pixel padding), random horizontal flip
[33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei,
                                                                 and Cutout [41] with length 5.
     Imagenet: A large-scale hierarchical image database,
                                                                 Hyperparameters See Table 5 for the details. For the base-
     in: Computer Vision and Pattern Recognition, 2009.
                                                                 lines, we follow the same hyperparameter configurations
     CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 248–
                                                                 used by [6]. 40% noise rate setting was used to find the best
     255.
                                                                 learning rates and weight decay rates. For the learning rates
[34] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Er-
                                                                 and weight decay rates for NRD, we used the same configura-
     han, A. Rabinovich, Training deep neural networks
                                                                 tions as GJS. For the tuning of hyperparameters {𝜋1 , 𝜋2 , 𝜋3 }
     on noisy labels with bootstrapping, arXiv preprint
                                                                 in the NRD loss, we fixed 𝜋1 = 𝜋3 so that the targets y and
     arXiv:1412.6596 (2014).
                                                                 yt have equal weight. We tuned 𝜋2 ∈ {0.1, 0.2, ..., 0.9}.
[35] M. Lukasik, S. Bhojanapalli, A. Menon, S. Kumar, Does
                                                                 For the moving average decay rate, we used 𝛽 = 0.99 for
     label smoothing mitigate label noise?, in: International
     Table 5
     Hyperparameters for CIFAR-10/100. The hyperparameters for the baseline methods are identical to [6]. For the learning
     rate and weight decay, each entry denotes [LR, WD]. For the method-specific hyperparameters, each entry denotes its
     hyperparameters: BS (𝛽 ), LS (𝜖), SCE ([𝛼, 𝛽]), GCE (𝑞 ), NCE+RCE ([𝛼, 𝛽]), JS (𝜋1 ), GJS (𝜋1 ), NRD ([𝜋1 , 𝜋2 , 𝜋3 ]).

                       Learning Rate & Weight Decay                                                  Method-specific Hyperparameters
 Dataset     Method
                       Sym Noise       Asym Noise      No Noise                                        Sym Noise                                                Asym Noise
                         20-80%          20-40%            0%                20%               40%                60%                 80%                 20%                 40%
             CE         [0.05, 1e-3]    [0.1, 1e-3]          -                 -                 -                  -                   -                   -                   -
             BS          [0.1, 1e-3]    [0.1, 1e-3]         0.5               0.5               0.7                0.7                 0.9                 0.7                 0.5
             LS          [0.1, 5e-4]    [0.1, 1e-3]         0.1               0.5               0.9                0.7                 0.1                 0.1                 0.1
             SCE        [0.01, 5e-4]   [0.05, 1e-3]     [0.2, 0.1]       [0.05, 0.1]        [0.1, 0.1]         [0.2, 1.0]           [0.1,1.0]          [0.1, 0.1]          [0.2, 1.0]
 CIFAR-10
             GCE        [0.01, 5e-4]    [0.1, 1e-3]         0.5               0.7               0.7                0.7                 0.9                 0.1                 0.1
             NCE+RCE   [0.005, 1e-3]   [0.05, 1e-4]     [10, 0.1]         [10, 0.1]         [10, 0.1]          [1.0, 0.1]           [10,1.0]           [10, 0.1]           [1.0, 0.1]
             JS         [0.01, 5e-4]    [0.1, 1e-3]         0.1               0.7               0.7                0.9                 0.9                 0.3                 0.3
             GJS         [0.1, 5e-4]    [0.1, 1e-3]         0.5               0.3               0.9                0.1                 0.1                 0.3                 0.3
             NRD         [0.1, 5e-4]    [0.1, 1e-3]   [0.2, 0.6, 0.2]   [0.2, 0.6, 0.2]   [0.2, 0.6, 0.2]   [0.25, 0.5, 0.25]   [0.25, 0.5, 0.25]    [0.1, 0.8, 0.1]    [0.15, 0.7, 0.15]
             CE         [0.4, 1e-4]    [0.2, 1e-4]           -                 -                 -                  -                   -                   -                   -
             BS         [0.4, 1e-4]    [0.4, 1e-4]          0.7               0.5               0.5                0.5                 0.9                 0.3                 0.3
             LS         [0.2, 5e-5]    [0.4, 1e-4]          0.1               0.7               0.7                0.7                 0.9                 0.5                 0.7
             SCE        [0.2, 1e-4]    [0.4, 5e-5]      [0.1, 0.1]        [0.1, 0.1]        [0.1, 0.1]         [0.1, 1.0]           [0.1,0.1]          [0.1, 1.0]          [0.1, 1.0]
 CIFAR-100
             GCE        [0.4, 1e-5]    [0.2, 1e-4]          0.5               0.5               0.5                0.7                 0.7                 0.7                 0.7
             NCE+RCE    [0.2, 5e-5]    [0.2, 5e-5]      [20, 0.1]         [20, 0.1]         [20, 0.1]          [20, 0.1]            [20,0.1]            [20, 0.1]          [10, 0.1]
             JS         [0.2, 1e-4]    [0.1, 1e-4]          0.1               0.1               0.3                0.5                 0.3                 0.5                 0.5
             GJS        [0.2, 5e-5]    [0.4, 1e-4]          0.3               0.3               0.5                0.9                 0.1                 0.5                 0.1
             NRD        [0.2, 5e-5]    [0.4, 1e-4]    [0.2, 0.6, 0.2]   [0.2, 0.6, 0.2]   [0.2, 0.6, 0.2]    [0.2, 0.6, 0.2]    [0.15, 0.7, 0.15]   [0.25, 0.5, 0.25]    [0.4, 0.2, 0.4]


all experiments.

A.2. WebVision benchmark
General training details For the network architecture, we
use ResNet-50 with random initialization. For training, we
use SGD optimizer with momentum 0.9, a batch size of 64,
and train for 300 epochs. The initial learning rate was set to
0.1 and reduced by 1/10 after the 100-th and 200-th epoch.
Augmentation policy For data augmentation, we use ran-
dom resized crop with size 224, random horizontal flip, and
color jitter. We used the color jitter implementation from
TorchVision [44] with brightness=0.4, contrast=0.4, satura-
tion=0.4, hue=0.2. For the NRD teacher augmentation, we
use AugMix [42] followed by random resize crop with size
224 and random horizontal flip.
Hyperparameters For the hyperparameters {𝜋1 , 𝜋2 , 𝜋3 }
in the NRD loss, we used 𝜋1 = 𝜋3 = 0.1 and 𝜋2 = 0.8.
The moving average decay rate was set to 𝛽 = 0.99.


B. Training dynamics visualization
   of perturbed inputs
In this section, we provide the visualized trajectory of the
model prediction of the perturbed inputs throughout train-
ing. (See Table 6) These are the same plots presented in Fig-
ure 2a, albeit on different mid-training epochs. The model
is trained using a standard training scheme with the cross-
entropy loss on the NoisyCIFAR-10-symm-40% dataset. We
observe that a significant portion of the predictions per-
turbed using augmentation unseen at training (AutoAug-
ment) gradually settles to the ground truth class, whereas
the predictions perturbed using the same augmentation pol-
icy used at training (RandomCrop) eventually converge to
the noisy target class. The result shows that predictions
from the perturbation identical to the training augmenta-
tion (red markers) are non-noise-robust distillation targets,
whereas the predictions from the unseen perturbation (blue
markers) are noise-robust distillation targets.
                                   Table 6
                                   Visualization of the model prediction over training. We randomly selected four distinct noisy samples from the training dataset,
                                   which corresponds to the four rows. The model is trained using RandomCrop and tested using RandomCrop-perturbed inputs
                                   (red) and AutoAugment-perturbed inputs (blue). Leftmost column shows the predicted confidence of the perturbed inputs
                                   with respect to the ground-truth classes. The figures on the right hand-side visualizes the softmax vectors projected onto a
                                   decagonal surface, which are analogous to Figure 2a. At the early phase of the training, both red and blue markers predict the
                                   ground-truth class. However, as the training progresses and the model overfits to the noisy labels, the red markers predict the
                                   target label, whereas a significant portion of the blue markers predicts the ground-truth markers. This shows that unseen
                                   perturbation to the input can produce noise-robust learning signal for training.

                                    Confidence of GT class                        Epoch 40                       Epoch 70                       Epoch 100                       Epoch 150
                         1.0
                                                                                label                          label                          label                           label
                                                      Autoaugment           truck airplane                 truck airplane                 truck airplane                  truck airplane
Confidence of GT class


                         0.8                          RandomCrop                           gt                             gt                             gt                              gt
                         0.6
                                                                        ship          automobile       ship          automobile       ship          automobile        ship          automobile
                         0.4
                                                                      horse                   bird   horse                   bird   horse                    bird   horse                    bird
                         0.2


                         0.0
                                                                        frog                 cat       frog                 cat       frog                  cat       frog                  cat
                               0      40   70 100       150     200
                                              Epoch                             dog deer                       dog deer                       dog deer                        dog deer
                         1.0
                                                      Autoaugment           truck airplane                 truck airplane                 truck airplane                  truck airplane
Confidence of GT class


                         0.8                          RandomCrop        ship          automobile       ship          automobile       ship          automobile        ship          automobile
                         0.6


                         0.4
                                                                      horse                   bird
                                                                                               gt
                                                                                                     horse                   bird
                                                                                                                              gt
                                                                                                                                    horse                    bird
                                                                                                                                                              gt
                                                                                                                                                                    horse                    bird
                                                                                                                                                                                              gt
                         0.2
                                                                        frog                 cat       frog                 cat       frog                  cat       frog                  cat
                         0.0
                               0      40   70 100       150     200             dog deer                       dog deer                       dog deer                        dog deer
                                              Epoch                             label                          label                          label                           label
                         1.0
                                                      Autoaugment           truck airplane                 truck airplane                 truck airplane                  truck airplane
                                                                                         label                          label                          label                           label
Confidence of GT class


                         0.8                          RandomCrop
                                                                        ship          automobile       ship          automobile       ship          automobile        ship          automobile
                         0.6


                         0.4                                          horse                   bird   horse                   bird   horse                    bird   horse                    bird
                         0.2


                         0.0
                                                                        frog
                                                                         gt
                                                                                             cat       frog
                                                                                                        gt
                                                                                                                            cat       frog
                                                                                                                                       gt
                                                                                                                                                            cat       frog
                                                                                                                                                                       gt
                                                                                                                                                                                            cat
                               0      40   70 100
                                              Epoch
                                                        150     200
                                                                                dog deer                       dog deer                       dog deer                        dog deer
                         1.0
                                                      Autoaugment           truck airplane                 truck airplane                 truck airplane                  truck airplane
Confidence of GT class


                         0.8                          RandomCrop
                                                                        ship          automobile       ship          automobile       ship          automobile        ship          automobile
                         0.6


                         0.4                                          horse
                                                                        gt
                                                                                              bird   horse
                                                                                                       gt
                                                                                                                             bird   horse
                                                                                                                                      gt
                                                                                                                                                             bird   horse
                                                                                                                                                                      gt
                                                                                                                                                                                             bird
                         0.2

                                                                        frog                 cat       frog                 cat       frog                  cat       frog                  cat
                         0.0
                                                                        label                          label                          label                           label
                               0      40   70 100
                                              Epoch
                                                        150     200
                                                                                dog deer                       dog deer                       dog deer                        dog deer

</pre>