<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Neural Vicinal Risk Minimization: Noise-robust Distillation for Noisy Labels</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hyounguk Shon</string-name>
          <email>hyounguk.shon@kaist.ac.kr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Seunghee Koh</string-name>
          <email>seunghee1215@kaist.ac.kr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yunho Jeon</string-name>
          <email>yhjeon@hanbat.ac.kr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Junmo Kim</string-name>
          <email>junmo.kim@kaist.ac.kr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Hanbat National University</institution>
          ,
          <addr-line>125, Dongseo-daero, Yuseong-gu, Daejeon, 34158</addr-line>
          ,
          <country country="KR">South Korea</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Korea Advanced Institute of Science and Technology (KAIST)</institution>
          ,
          <addr-line>291 Daehak-ro, Yuseong-gu, Daejeon, 34141</addr-line>
          ,
          <country country="KR">South Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Training deep neural networks with noisy supervision remains a challenging problem in weakly supervised learning. Mislabeled instances can severely degrade the generalization ability of classification models to unseen data. In this paper, we propose a novel regularization method coined Noise-robust Distillation (NRD) that addresses robust training under noisy supervision. NRD is motivated from a novel learning framework which we name Neural Vicinal Risk (NVR) minimization to improve the estimation quality of the data distribution and handle label noise efectively. Our framework is based upon our observation that a neural network has capability to correctly classify data sampled from vicinal distribution even when the model is overfitted to noisy label. By ensembling the predictions from the neural vicinal distribution, we obtain an accurate estimation of the class probabilities that reflects sample-wise class ambiguity. We validated our method through various noisy label benchmarks and demonstrate significant improvement in robustness to label noise.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Learning with Label Noise</kwd>
        <kwd>Vicinal Risk Minization</kwd>
        <kwd>Noise-robust Loss</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Deep learning models have achieved remarkable success in
various domains, including image classification, natural
language processing, and speech recognition. However, the
performance of these models heavily relies on the availability of
high-quality labeled data for training. Obtaining accurately
annotated labels can be a challenging and time-consuming
task, often requiring human annotators to manually label
large amounts of data. As a result, noisy labels may arise
during the annotation process, leading to suboptimal model
performance.</p>
      <p>
        In this paper, we address noisy label learning as a subset
of a more generic type of problem. This encompasses
learning from an over-confident target probability distribution
and image ambiguity [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], human annotation errors,
multiple classes in an image, and out-of-distribution training
examples [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] that can naturally occur due to, for example,
random crop data augmentation. We show that our generic
noisy label supervision algorithm can address a combination
of these issues using a simple and unified approach.
      </p>
      <p>We propose a noise-robust learning algorithm named
Noise-Robust Distillation (NRD) to address the issue of noisy
supervision during training. NRD aims to improve the
generalization performance of classification models by explicitly
considering the noise and ambiguity in the training labels.
We motivate NRD by a novel formulation of the noisy
supervision learning problem which we name Neural Vicinal
Risk (NVR) minimization.</p>
      <p>This stems from the observation that deep neural
networks have the inherent capability to detect and correct
noisy supervision, even when it is trained using noisy
supervision. This ability is particularly evident when considering
the vicinal distribution, which represents the distribution
generated from perturbed versions of the training data.
Despite being trained on noisy labels, neural networks can still</p>
      <sec id="sec-1-1">
        <title>NoisyCIFAR-10-symm-50%</title>
      </sec>
      <sec id="sec-1-2">
        <title>Not transformed Transformed</title>
        <p>AUROC: 0.9935
y
t
i
s
n
e
D
2</p>
        <p>1
0</p>
        <p>10 20 30</p>
        <p>GT class log-likelihood
accurately model the vicinal distribution, indicating their
potential to correct the noisy supervision.</p>
        <p>Our findings suggest that the combination of
perturbation-based estimation and ensembling can
lead to improved model performance, even in the presence
of noisy supervision. Building on these insights, we propose
Noise-Robust Distillation (NRD), which is a noise-robust
learning method that leverages the neural vicinal risk
principle to enhance the generalization performance of
classification models trained on noisy labels.</p>
        <p>The main contributions of this work are as follows:
• We introduce the Noise-Robust Distillation (NRD),
a noise-robust learning approach that
comprehensively addresses the challenges posed by noisy
supervision during training.
• NRD is motivated by a novel noise-robust
learning framework which we name Neural Vicinal Risk
(NVR) minimization. We show that NVR improves
the estimation quality of the true class distribution
and handles label noise efectively.
• We demonstrate the ability of neural networks to
horse
ship
frog
ship
frog
horse
truck airplane
automobile ship
truck airplane
label, gt
automobile ship
label, gt
truck airplane
automobile ship
labbeirl,dgt
cat
horse
frog
bird</p>
        <p>horse
cat
frog
detect and correct mislabeled examples through
sensitivity to perturbations in the input data, leading to
improved model predictions and calibration.
• We validate the efectiveness of NRD through
experiments on benchmark datasets, showing clear
improvements in model performance in comparison
to standard training methods under noisy
supervision.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>
        Noisy label learning Numerous methods tackle the
challenge of training Deep Neural Networks (DNNs) on datasets
that contain a mix of correctly labeled and mislabeled
samples, as discussed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Some approaches focus on
designing a noisy-robust loss to mitigate the impact of mislabeled
samples. Mean Absolute Error (MAE) loss [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] demonstrates
competitive performance. Following this, the introduction
of the Generalized Cross-Entropy (GCE), Symmetric
CrossEntropy (SCE) loss, and active passive loss are proposed with
improved noisy-robustness. Generalized Jensen-Shannon
divergence (GJS) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] enforces consistency between
predictions from multiple augmented views of a sample to
regularize training. Also, the principle of negative learning is
emphasized by [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. The strategies inspired by the
training dynamics of models [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] such as early stopping [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]
or over-parameterization [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] exploit the diferent
convergence speeds of clean and noisy samples. Co-teaching [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
involves simultaneous training of two DNNs, where each
network learns from the clean samples chosen by its
counterpart. Noise identification aims to filter noisy samples
from the training dataset. Noisy samples can be filtered by
measuring the degree of disagreement between ensemble
models, which occurs once the model is overfitted to the
noisy samples. Recent algorithms [
        <xref ref-type="bibr" rid="ref14 ref15 ref16">14, 15, 16</xref>
        ] utilize the
power of Semi-Supervised Learning (SSL) by following a
two-step process: filtering out noisy labels first, and then
treating the detected noisy samples as unlabeled for
reducing the noisy learning problem into a SSL task.
      </p>
      <p>
        Semi-supervised learning (SSL) has emerged as a
powerful method for noisy label learning. Among them,
consistency regularization promotes a model to make consistent
outputs across data augmentations, as in Π -model,
Temporal Ensembling [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and Mean Teacher [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Also, FixMatch
[
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] integrates pseudo-labeling and and virtual adversarial
training [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] utilizes adversarial attacks. MixMatch [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ],
adopted by DivideMix [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], generates pseudo-label with
sharpening for data-augmented unlabeled examples and
mixes labeled and unlabeled data using MixUp [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>
        Calibration and knowledge distillation Confidence
calibration [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] is the process of adjusting a model’s
predicted probabilities to better reflect the true likelihood. It is
demonstrated that training a model with data augmentation
like Mixup [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] improves model calibration and robustness
to noise [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Meanwhile, Knowledge Distillation (KD) [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]
enhances the student model by transferring knowledge
contained in the prediction of the teacher model, focusing on
"dark" or "hidden" knowledge, including its confident and
less confident predictions.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Preliminaries</title>
      <sec id="sec-3-1">
        <title>3.1. Notations</title>
        <p>Consider a DNN classification model parameterized by
 ∈ Θ as  (,  ) :  ↦→ ∆ − 1 which outputs a
probability distribution  (|;  ). The input space is defined as
 = R×  ×  where , ,  are the number of height,
width, and color channels of the image data. ∆  indicates
-simplex. The model takes an image input  ∈  and
predicts a categorical distribution over  = {1, 2, ..., }. We
denote an image augmentation operation as  () :  →
 , and the training dataset as  = {(, )}. The loss
function is defined as ℓ(, ,  ) :  ×  × Θ ↦→ R.  (· ) is
the Dirac delta function and 1{·} is the indicator function.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Empirical Risk</title>
        <p>The expected risk ( ) is defined as the average loss over
(, ),
∫︁</p>
        <p>,
( ) =
ℓ(, ,  )(, )  .
In practice, a dataset  is used to mimic the true distribution
(, ), which leads to the empirical risk
ˆ( ) =
∫︁
,</p>
        <p>ℓ(, ,  )ˆ(, )  .
where the corresponding empirical distribution ˆ(, ) is
a mixture of delta masses using the observed samples, and
the class distribution is a one-hot distribution given by
annotations,
ˆ(, ) =</p>
        <p>1 ∑︁ 1{=} ( − ) .</p>
        <p>=1</p>
        <p>Our goal is to refine the estimation of the data
distribution (, ) by utilizing the empirical distribution ˆ(, ).</p>
        <p>
          A pivotal question that arises is how to enhance the
approximation of the true risk ( ) intrinsic to a classification
model. As evidenced by Equation (3), this task
necessitates the accurate estimation of two orthogonal components
present within the true distribution (, ) =  (|)():
(1) the input distribution () and (2) the corresponding
conditional distribution  (|).
3.3. Neural Empirical Risk
Estimating  (|) as a one-hot distribution involves
assigning a single class label per sample, which is vulnerable
to human annotation errors. Unfortunately, it proves
challenging to enhance or secure accurate supervision signals
for  (|), as this requires multiple human annotators
reviewing the same image [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] which is a prohibitively costly
process. Nonetheless, enhancing the estimation quality of
the true class distribution  (|) can lead to further
improvements in estimating and minimizing the true risk.
        </p>
        <p>Neural Empirical Risk (NER) Instead of using
Equation (3), we can choose to parameterize  (|) by a neural
network  (|, ) to further improve the estimation
quality. First, we factorize the data distribution as (, ) =
 (|)(), and denote the corresponding empirical
distributions as follows:</p>
        <p>ˆ() = 1 ∑︁  ( − )</p>
        <p>=1
ˆ (|) = 1{=} .</p>
        <p>Instead of using ˆ (|), we choose to use a distribution
parameterized by a neural network trained on ,
 (|, ) =</p>
        <p>
          (|, )(|)  ,
∫︁

in knowledge distillation, which we view as an instance
of NER minimization. Knowledge distillation is known to
improve generalization and calibration performance due to
the dark knowledge [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ].
        </p>
        <p>
          However, when the model is over-fitted to the noisy label,
it severely degrades the performance of estimating the class
probabilities. Hence, in order to efectively utilize a neural
network, it is necessary to employ a noise-robust method to
accurately estimate the class probabilities in the presence
of noisy labels.
3.4. Vicinal risk for noise-robust learning
Our motivation is based on the Vicinal Risk Minimization
(VRM) principle [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ], which is an alternative approximation
to (, ). The vicinal distribution  (˜, ˜) constructed
from the data distribution is defined as
 (˜, ˜) =
 (˜, ˜|, )(, )  .
        </p>
        <p>(9)
∫︁
,
∫︁
where  (˜, ˜|, ) is the vicinity distribution around (, ).</p>
        <p>
          For example, [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] used additive Gaussian noise  (0,  2).
        </p>
        <p>
          MixUp [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] and CutMix [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] chose stochastic interpolation
between samples which has also shown its efectiveness in
noisy label. [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. Using the dataset, Equation (9) is replaced
by the empirical distribution as
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(10)
(11)
(12)
(13)
(15)
(16)
ˆ (˜, ˜) =
        </p>
        <p>(˜, ˜|, )ˆ(, ) 
=1
,</p>
        <p>= 1 ∑︁  (˜, ˜|, ) .</p>
        <p>Neural Vicinal Risk (NVR) We propose to further
improve by using a neural network to robustly approximate
the data distribution by modifying Equation (9). We propose
the following approximate vicinal data distribution
parameterized by a deep neural network  which we name neural
vicinal distribution  .
where (|) is the distribution over the function class
parameterized by neural network. By plugging Equation (6)
into ˆ(, ) = ˆ (|)ˆ(), we define the neural empirical
distribution ˆ and the neural empirical risk ˆ as</p>
        <p>ˆ () =
ˆ (, |) =  (|, )ˆ()
∫︁
,</p>
        <p>ℓ(, , )ˆ (, |)  .</p>
        <p>Here, we refer to the model  (|, ) as the teacher network
to distinguish from the model being trained, whose term
is borrowed from knowledge distillation. This can provide
better estimation quality than ˆ (|) as is often observed
≈
=
∫︁


∫︁

∫︁

 (˜|˜, )(|)</p>
        <p>(˜|)()
 (˜|˜, ) ( − * )</p>
        <p>(˜|)^() (14)
=  (˜|˜, * ) 1 ∑︁  (˜|)</p>
        <p>=1
= 1 ∑︁  (˜|˜, * ) (˜|) .</p>
        <p>=1
 (˜, ˜|) =  (˜|˜; )(˜)</p>
        <p>∫︁
Here, * = arg min ˆ() is the maximum-a-posteriori
(MAP) model trained on . It is important to note that
the samples from the vicinal distribution  (˜|) is not
shown at the training of the model * . Equation (14) is
given by substituting the Bayesian model with the MAP
model and also replacing the true distribution () with the
empirical distribution. The true neural vicinal distribution
is approximated by the ensembled MAP model predictions
averaged over the samples from the vicinal distribution.</p>
        <p>Therefore, we define the empirical neural vicinal
distribution ˆ as,</p>
        <p>ˆ (˜, ˜; * ) = 1 ∑︁  (˜|˜, * ) (˜|) .</p>
        <p>=1
(17)
augmentation</p>
        <p>Student
augmentation</p>
        <p>EMA
update
stop-grad
(student augmentation) is regularized using the predictions generated from unseen views (teacher augmentation). We use asymmetric
augmentation policy so that the teacher augmentation generates novel views, and the stop-gradient operation ensures that the model
does not memorize the views generated from the teacher augmentation.</p>
        <p>Note that  (˜|) is distinct from the augmentation strategy
applied to the model being trained. Similar to Equation (8),
we refer to the  (˜|) as teacher augmentation.
3.5. Self-correction for memorized instances
We further discuss the behavior of the neural vicinal
distribution over a noisy training dataset. Notably, when a
training dataset includes mislabeled instances, a teacher
neural network can overfit to these noisy labels, where the
neural empirical risk minimization fails in mitigating the
impact. Interestingly, we observe that the neural vicinal
distribution exhibits robustness against label noise, efectively
self-correcting incoherent labels within the training set.</p>
        <p>
          To understand this phenomenon, we visualize the
behavior of the neural vicinal distribution in Figure 2a. Here,
we compare the softmax scores from the augmented input
samples, distinguishing between the neural empirical
distribution (red marker) and the neural vicinal distribution
(blue markers). For the transformation policy, the network
was trained using random crop augmentation and
AutoAugment [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] is chosen as the vicinal distribution to generate
the novel views. The top row shows clean instances and the
bottom row shows mislabeled instances.
        </p>
        <p>The visual analysis contrasts the softmax predictions from
both seen and novel views of clean and mislabeled training
instances. The self-correction of the neural vicinal
distribution is instance-dependent which responds diferently based
on if an instance is clean or mislabeled. Notably, while the
teacher network’s predictions for the novel views tend to
shift misclassified predictions towards ground truth, they
remain consistent for clean samples. This suggests that the
network outputs corrected predictions by dissociating the
novel views from the memorized views.</p>
        <p>Next, in Figure 1, we analyzed the label correction
behavior of the neural vicinal distribution over the dataset
population. Note that the models are trained only using
the noisy training set, without access to the ground-truth
labels. Applying transformation (blue curve) significantly
reduced the GT class cross-entropy loss compared to no
transformation (red curve), and we observed a good
separation between the two distributions. Also, Table 1 shows
the ground-truth accuracy for the training samples where
we observed significant improvements for the mislabeled
instances when transformation is applied.</p>
        <p>We additionally observed that ensembling perturbed
predictions enhances the calibration, as depicted in Figure 2b.</p>
        <p>While the original model is heavily over-confident due to
overfitting (red), vicinal prediction improves accuracy and
reflects class ambiguities. (blue)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Method</title>
      <p>Motivated from the observation in Section 3.5, we propose a
novel learning method for noisy labels named Noise-robust
Distillation (NRD). Our method is formulated as a simple
loss function which makes it easy to employ in existing
training pipeline.</p>
      <p>For this, we combine the target loss with the neural
vicinal risk loss as a regularization objective. We formulate
the combined objectives into a triplet loss. We have found
Jensen-Shannon divergence (JSD) to be efective which
generalizes to a triplet loss. The JSD for three distributions
is,</p>
      <p>JSD (p1, p2, p3) = ∑︁
 KL (pi||m) ,</p>
      <p>
        (19)

where m = ∑︀  pi. The hyperparameter 
chosen to balance the importance weight between the
distributions. Additionally, JS divergence is known to have a
nice robustness property against label noise. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] showed
that JS divergence simulates MAE loss [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] in its asymptote.
∈ ∆ 2 is
      </p>
      <p>Next, we derive our NRD objective step-by-step. By
applying NVR to JSD loss, we have
ℒ( ; , y, ) = JSD (y, ys, y)
ys =  (,  )
yt =</p>
      <p>E
 (˜|)</p>
      <p>[ (˜, )] ,

¯
assuming that we have a trained teacher network . Here,
y is the target label, ys is the model output and yt
is the teacher network output. The loss is solved for
min ℒ(y, ys, yt).</p>
      <p>Equation (20),</p>
      <p>To improve noise-robustness, we can further employ an
iterative distillation scheme which we repeat the strategy
for multiple rounds of training. We set the teacher network
as the model obtained from the previous training round,
such that  =  − 1 at the -th training round. Applying to
  = arg min ℒ( ; , y,  − 1) .</p>
      <p>A student network obtained from previous training round
is switched to the teacher role for next round. However,
in practice, we found this to be unstable and dificult to
historical models as the teacher and set  = ¯ − 1.
converge. Instead, we take the exponential average of the
¯
  =  ·  − 1 + (1 −  ) ·   .</p>
      <p>For the decay rate, we simply set  = 0.99 for all
experiments. The aggregation reduces the variance of neural
vicinal risk estimation caused by stochastic gradient, and
we have empirically found that it efectively stabilizes the
training and lead to faster convergence.</p>
      <p>Finally, we formally define our NRD training objective.
To reduce the training cost, we simplify each training round
into a single step of stochastic gradient descent. (SGD)
This simplifies the algorithm from multi-staged process into
a single-staged process, and significantly accelerates the
training. The NRD objective is,
ℒNRD( ; , y, ¯ ) = JSD (y, ys, yt)
ys =  (,  )
yt =</p>
      <p>E
 (˜|)
︀[  (˜, ¯ )︀] ,
with a slight abuse of notation for ¯ , which is not an
optimization variable but continuously updated after each
SGD step. This is implemented by detaching yt from the
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)</p>
      <p>Algorithm 1 PyTorch-style pseudocode
ema_model = ema(model)
optimizer = sgd_optimizer(model)
for x, y in dataloader:
x_t = teacher_aug(x)
x_s = student_aug(x)
# disconnect from backprop
y_t = ema_model(x_t).detach()
y_s = model(x_s)
# distance between predictions
loss = js_div(y, y_s, y_t)
loss.backward()
optimizer.step()
ema_model.update()
backpropagation graph, which prevents the model from
memorizing the teacher augmentation views. (stop-grad in
step was suficient. The overall architecture is illustrated in</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <p>
        5.1. Experimental settings
Benchmarking datasets For synthetic label noise
benchmarks, we used NoisyCIFAR-10, NoisyCIFAR-100 [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. For
symmetric label noise, we randomly flip the ground truth
label with a probability  uniformly across all categories.
For asymmetric label noise, we follow the scheme in [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ].
For NoisyCIFAR-10-asymm, we flip truck
→automobile,
bird→airplane, cat→dog, dog→cat, deer→horse.
For
NoisyCIFAR-100-asymm, within each superclass, we
randomly replace a subclass label  to adjacent subclass  + 1
with probability  .
      </p>
      <p>
        For the real-world benchmark, we used WebVision [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]
dataset. WebVision consists of 2.4M training examples
collected via Google and Flickr image search. We used a
miniaturized training set following [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] which uses only the first
50 categories in the “Google” image set. Mini-WebVision
consists of 66K training and 2.5K validation examples. We
additionally evaluated the trained model on ImageNet [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]
validation set. The noise rate is known to be around 20%.
Baseline methods For the CIFAR benchmarks, we
compare against cross-entropy (CE), bootstrapping (BS) [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ],
label smoothing (LS) [
        <xref ref-type="bibr" rid="ref35">35</xref>
        ], symmetric cross-entropy (SCE)
[
        <xref ref-type="bibr" rid="ref36">36</xref>
        ], generalized cross-entropy (GCE) [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ], normalized loss
(NCE+RCE) [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ], Jensen-Shannon divergence (JS, GJS) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        For the WebVision benchmarks, we compared our method
with the state-of-the-art methods including ELR+ [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
DivideMix [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and GJS [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The baseline results were adopted
from [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Models PreActResNet-34 architecture [
        <xref ref-type="bibr" rid="ref39">39</xref>
        ] is used for all
experiments conducted on CIFAR-10/100 datasets. For
WebVision experiments, we used ResNet-50. All experiments
were trained from random initialization.
      </p>
      <p>
        Augmentation policy For the CIFAR experiments, we
followed [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and used RandAugment [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ] chained with Cutout
[
        <xref ref-type="bibr" rid="ref41">41</xref>
        ] for all methods. For the NRD teacher transformation,
we used AugMix [
        <xref ref-type="bibr" rid="ref42">42</xref>
        ] in all experiments.
      </p>
      <p>Hyperparameters For CIFAR-10/100 benchmarks, we used
400 epochs for each training. We used SGD optimizer with
momentum 0.9 and weight decay of 10− 4. Learning rates</p>
      <sec id="sec-5-1">
        <title>5.2. Results</title>
        <p>Performance on noisy label benchmarks In Table 2,
we show the performance of our method in comparison to
robust loss functions. While most of the baselines shows
inconsistent performance between symmetric and
asymmetric noise types, our method shows consistent improvement
across a wide range of noise rates and noise types.
Notably, we significantly improve performance under high
noise rate settings where GJS tend to underperform. For
NoisyCIFAR-10 80% noise, we improve by 5%p over SCE,
and for NoisyCIFAR-100-80%, we improve by 10%p over
GCE.</p>
        <p>Furthermore, the results on large-scale real-world noisy
label benchmark is shown in Table 3. Notably, we observed
that our method outperforms over existing methods that
uses two networks.</p>
        <p>Performance on clean datasets The proposed method
improves model generalization when applied to clean dataset
training as seen in Table 4. This is because the training
dataset contains visually ambiguous images that make it
dificult to draw a clear decision boundary, and therefore
the hard target distributions from the annotations serve as
a type of noisy supervision signal. We show that applying
NRD can regularize and improve the performance of the
model.</p>
        <p>Comparison to consistency regularization Consistency
regularization used in GJS is a powerful technique for
noiserobustness. While it is similar to NRD, however, it does
not directly prevent memorization of noisy labels. Figure 4
shows that GJS sufers from overfitting when trained for an
extended number of steps. This is shown by test accuracy
decreasing after reaching a peak at an early epoch. In
contrast, NRD significantly mitigates overfitting. Notably, in
80% noise rate setting, we improve GJS by 36%p. The key
contributing factor is that our method uses stop-gradient
which directly prevents the model from memorizing the
views generated by the asymmetric augmentation policy.
0 0 0.2 Co0.n4fide0n.6ce 0.8 1.0</p>
        <p>Confidence calibration In Figure 5, we additionally
evaluated the calibration performance. We observed that the
regularization efect from NRD also improves calibration
of the model. Our method shows consistent calibration
performance across all noise rates, which aligns with the
performance of our method.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>Our work proposes Noise-Robust Distillation (NRD) which
is a simple regularization objective that is designed to
improve a wide range of noisy supervision problems in training.
We motivate our method based on the novel formulation
of Neural Vicinal Risk (NVR) minimization, which focuses
on leveraging deep neural networks to improve empirical
risk minimization under noisy supervision scenarios. A key
insight of our work is the inherent capacity of deep neural
networks to detect and correct mislabeled examples based
on vicinal distribution, a feature we exploited to improve
model predictions and calibration. We have validated our
method on several noisy label learning benchmarks. The
results show clear improvements in performance compared
to the baselines under noisy supervision. These findings
suggest that NRD ofers an efective strategy for handling
noisy supervision, leading to enhanced generalization
performance of classification models.</p>
      <p>Acknowledgement This work was supported by the
National Research Foundation of Korea(NRF) grant funded
by the Korea government(MSIT) (No. RS-2023-00240379).
1.0
0.8
cy0.6
rau
cc0.4
A
0.2</p>
      <p>A. Detailed hyperparameter</p>
      <p>
        configurations
A.1. CIFAR-10/100 benchmarks
General training details For the network architecture,
we use PreActResNet-34 [
        <xref ref-type="bibr" rid="ref43">43</xref>
        ]. For training, we use SGD
optimizer with momentum 0.9, a batch size of 128, and train
for 400 epochs. The learning rate is reduced by 1/10 at 50%
and 75% of the training iterations.
      </p>
      <p>
        Augmentation policy For data augmentation, we use
RandAugment [
        <xref ref-type="bibr" rid="ref40">40</xref>
        ] with  = 1,  = 3 followed by random
crop (size 32 and 4-pixel padding), random horizontal flip
and Cutout [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ] with length 5.
      </p>
      <p>
        Hyperparameters See Table 5 for the details. For the
baselines, we follow the same hyperparameter configurations
used by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. 40% noise rate setting was used to find the best
learning rates and weight decay rates. For the learning rates
and weight decay rates for NRD, we used the same
configurations as GJS. For the tuning of hyperparameters { 1,  2,  3}
in the NRD loss, we fixed  1 =  3 so that the targets y and
yt have equal weight. We tuned  2 ∈ {0.1, 0.2, ..., 0.9}.
For the moving average decay rate, we used  = 0.99 for
      </p>
      <p>
        A.2. WebVision benchmark
General training details For the network architecture, we
use ResNet-50 with random initialization. For training, we
use SGD optimizer with momentum 0.9, a batch size of 64,
and train for 300 epochs. The initial learning rate was set to
0.1 and reduced by 1/10 after the 100-th and 200-th epoch.
Augmentation policy For data augmentation, we use
random resized crop with size 224, random horizontal flip, and
color jitter. We used the color jitter implementation from
TorchVision [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ] with brightness=0.4, contrast=0.4,
saturation=0.4, hue=0.2. For the NRD teacher augmentation, we
use AugMix [
        <xref ref-type="bibr" rid="ref42">42</xref>
        ] followed by random resize crop with size
224 and random horizontal flip.
      </p>
      <p>Hyperparameters For the hyperparameters { 1,  2,  3}
in the NRD loss, we used  1 =  3 = 0.1 and  2 = 0.8.
The moving average decay rate was set to  = 0.99.
B. Training dynamics visualization
of perturbed inputs
In this section, we provide the visualized trajectory of the
model prediction of the perturbed inputs throughout
training. (See Table 6) These are the same plots presented in
Figure 2a, albeit on diferent mid-training epochs. The model
is trained using a standard training scheme with the
crossentropy loss on the NoisyCIFAR-10-symm-40% dataset. We
observe that a significant portion of the predictions
perturbed using augmentation unseen at training
(AutoAugment) gradually settles to the ground truth class, whereas
the predictions perturbed using the same augmentation
policy used at training (RandomCrop) eventually converge to
the noisy target class. The result shows that predictions
from the perturbation identical to the training
augmentation (red markers) are non-noise-robust distillation targets,
whereas the predictions from the unseen perturbation (blue
markers) are noise-robust distillation targets.</p>
      <p>horse</p>
      <p>horse
bird
cat
bird
gt
cat
bird
cat
bird
bird
cat
bird
gt
cat
bird
cat
bird
bird
cat
bird
gt
cat
bird
cat
bird
bird
cat
bird
gt
cat
bird
cat
bird
horse
ship
frog
ship
frog
horse
horse
ship
frog
gt
ship
horse
ship
frog
ship
frog
horse
horse
ship
frog
gt
ship
horse
gt</p>
      <p>Epoch 150
label
truck airplane</p>
      <p>automgtobile
dog deer
ship
horse
gt
dog deer
truck airplane</p>
      <p>automobile
truck airplane</p>
      <p>label
automobile
dog deer
truck airplane
automobile
horse
ship
frog
ship
frog
horse
ship
frog
gt
ship
horse
gt
150</p>
      <p>200
1.0
sscT
la0.8
fG0.6
eeodn
c0.4
ifn0.2
oC
0.00
1.0
sscT
la0.8
fG0.6
eeodn
c0.4
ifn0.2
oC
0.00
1.0
sscT
la0.8
fG0.6
eeodn
c0.4
ifn0.2
oC
0.00
1.0
sscT
la0.8
fG0.6
eeodn
c0.4
ifn0.2
oC
0.00</p>
      <p>ldaobegl deer</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmarje</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Grossmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zelenka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dippel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kiko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Oszust</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pastell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Valros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Volkmann</surname>
          </string-name>
          , et al.,
          <article-title>Is one annotation enough?-a datacentric image classification benchmark for noisy and ambiguous label estimation</article-title>
          ,
          <source>in: Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Heo</surname>
          </string-name>
          , D. Han,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Choe,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chun</surname>
          </string-name>
          ,
          <article-title>Relabeling imagenet: from single to multi-labels, from global to localized labels</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>2340</fpage>
          -
          <lpage>2350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Lehman</surname>
          </string-name>
          , Visualizing softmax,
          <year>2019</year>
          . URL: https://charlielehman.github.io/post/ visualizing-tempscaling/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-G.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Learning from noisy labels with deep neural networks: A survey</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <article-title>Robust loss functions under label noise for deep neural networks</article-title>
          ,
          <source>in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Englesson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Azizpour</surname>
          </string-name>
          ,
          <article-title>Generalized jensenshannon divergence loss for learning with noisy labels</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>34</volume>
          (
          <year>2021</year>
          )
          <fpage>30284</fpage>
          -
          <lpage>30297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          , Nlnl:
          <article-title>Negative learning for noisy labels</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          (ICCV),
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <article-title>Joint negative and positive learning for noisy labels</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>9442</fpage>
          -
          <lpage>9451</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Arpit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jastrzębski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ballas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Krueger</surname>
          </string-name>
          , E. Bengio,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Kanwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Maharaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lacoste-Julien</surname>
          </string-name>
          ,
          <article-title>A closer look at memorization in deep networks</article-title>
          ,
          <source>in: Proceedings of the 34th International Conference on Machine Learning</source>
          , volume
          <volume>70</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>233</fpage>
          -
          <lpage>242</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Niles-Weed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Razavian</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>FernandezGranda, Early-learning regularization prevents memorization of noisy labels</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>20331</fpage>
          -
          <lpage>20342</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y</surname>
          </string-name>
          . Yang,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mao</surname>
          </string-name>
          , G. Niu, T. Liu,
          <article-title>Understanding and improving early stopping for learning with noisy labels</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>34</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>24392</fpage>
          -
          <lpage>24403</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <article-title>Robust training under label noise by over-parameterization</article-title>
          ,
          <source>in: Proceedings of the 39th International Conference on Machine Learning</source>
          , volume
          <volume>162</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>14153</fpage>
          -
          <lpage>14172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Tsang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sugiyama</surname>
          </string-name>
          , Co-teaching:
          <article-title>Robust training of deep neural networks with extremely noisy labels</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>31</volume>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C.</given-names>
            <surname>Hoi</surname>
          </string-name>
          ,
          <article-title>Dividemix: Learning with noisy labels as semi-supervised learning</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Neighborhood collective estimation for noisy label identification and correction</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Centrality and consistency: Two-stage clean samples identification for learning with instance-dependent noisy labels</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , volume
          <volume>13685</volume>
          ,
          <year>2022</year>
          , pp.
          <fpage>21</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Laine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Aila</surname>
          </string-name>
          ,
          <article-title>Temporal ensembling for semisupervised learning</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2017</year>
          . URL: https:// openreview.net/forum?id=
          <fpage>BJ6oOfqge</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tarvainen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Valpola</surname>
          </string-name>
          ,
          <article-title>Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sohn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Berthelot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Carlini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Cubuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kurakin</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-L. Li</surname>
          </string-name>
          ,
          <article-title>Fixmatch: Simplifying semi-supervised learning with consistency and confidence</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>33</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>596</fpage>
          -
          <lpage>608</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>T.</given-names>
            <surname>Miyato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Goodfellow</surname>
          </string-name>
          ,
          <article-title>Adversarial training methods for semi-supervised text classification</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>D.</given-names>
            <surname>Berthelot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Carlini</surname>
          </string-name>
          , I. Goodfellow,
          <string-name>
            <given-names>N.</given-names>
            <surname>Papernot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <article-title>Mixmatch: A holistic approach to semi-supervised learning</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>32</volume>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cisse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. N.</given-names>
            <surname>Dauphin</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Lopez-Paz, mixup: Beyond empirical risk minimization</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>C.</given-names>
            <surname>Guo</surname>
          </string-name>
          , G. Pleiss,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <article-title>On calibration of modern neural networks</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1321</fpage>
          -
          <lpage>1330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Thulasidasan</surname>
          </string-name>
          , G. Chennupati,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Bilmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Michalak</surname>
          </string-name>
          ,
          <article-title>On mixup training: Improved calibration and predictive uncertainty for deep neural networks</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>32</volume>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distilling the knowledge in a neural network</article-title>
          ,
          <year>2015</year>
          . arXiv:
          <volume>1503</volume>
          .
          <fpage>02531</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>O.</given-names>
            <surname>Chapelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bottou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vapnik</surname>
          </string-name>
          , Vicinal risk minimization,
          <source>Advances in neural information processing systems</source>
          <volume>13</volume>
          (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>S.</given-names>
            <surname>Yun</surname>
          </string-name>
          , D. Han,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Choe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yoo</surname>
          </string-name>
          , Cutmix:
          <article-title>Regularization strategy to train strong classiifers with localizable features</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF international conference on computer vision</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>6023</fpage>
          -
          <lpage>6032</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Cubuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vasudevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Autoaugment: Learning augmentation strategies from data</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <article-title>Learning multiple layers of features from tiny images</article-title>
          ,
          <source>Technical Report</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>G.</given-names>
            <surname>Patrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rozza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Menon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <article-title>Making neural networks robust to label noise: a loss correction approach</article-title>
          , stat
          <volume>1050</volume>
          (
          <year>2016</year>
          )
          <fpage>13</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Agustsson</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Van Gool</surname>
          </string-name>
          ,
          <article-title>Webvision database: Visual learning and understanding from web data</article-title>
          ,
          <source>arXiv preprint arXiv:1708.02862</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>P.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. B.</given-names>
            <surname>Liao</surname>
          </string-name>
          , G. Chen,
          <string-name>
            <surname>S. Zhang,</surname>
          </string-name>
          <article-title>Understanding and utilizing deep neural networks trained with noisy labels</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1062</fpage>
          -
          <lpage>1070</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>Imagenet: A large-scale hierarchical image database</article-title>
          ,
          <source>in: Computer Vision and Pattern Recognition</source>
          ,
          <year>2009</year>
          .
          <article-title>CVPR 2009</article-title>
          . IEEE Conference on, IEEE,
          <year>2009</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rabinovich</surname>
          </string-name>
          ,
          <article-title>Training deep neural networks on noisy labels with bootstrapping</article-title>
          ,
          <source>arXiv preprint arXiv:1412.6596</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lukasik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhojanapalli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Menon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Does label smoothing mitigate label noise?</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6448</fpage>
          -
          <lpage>6458</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bailey</surname>
          </string-name>
          ,
          <article-title>Symmetric cross entropy for robust learning with noisy labels</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          (ICCV),
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , M. Sabuncu,
          <article-title>Generalized cross entropy loss for training deep neural networks with noisy labels</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>31</volume>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , H. Huang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Romano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Erfani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bailey</surname>
          </string-name>
          ,
          <article-title>Normalized loss functions for deep learning with noisy labels</article-title>
          ,
          <source>in: Proceedings of the 37th International Conference on Machine Learning</source>
          , volume
          <volume>119</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6543</fpage>
          -
          <lpage>6553</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: CVPR</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Cubuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shlens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Randaugment: Practical automated data augmentation with a reduced search space</article-title>
          .
          <source>2020 ieee, in: CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>3008</fpage>
          -
          <lpage>3017</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <surname>T. DeVries</surname>
          </string-name>
          , G. W. Taylor,
          <article-title>Improved regularization of convolutional neural networks with cutout</article-title>
          ,
          <source>arXiv preprint arXiv:1708.04552</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. D.</given-names>
            <surname>Cubuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gilmer</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. Lakshminarayanan,</surname>
          </string-name>
          <article-title>AugMix: A simple data processing method to improve robustness and uncertainty</article-title>
          ,
          <source>Proceedings of the International Conference on Learning Representations (ICLR)</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Identity mappings in deep residual networks</article-title>
          ,
          <source>in: Computer Vision-ECCV</source>
          <year>2016</year>
          : 14th European Conference, Amsterdam, The Netherlands,
          <source>October 11-14</source>
          ,
          <year>2016</year>
          , Proceedings,
          <source>Part IV 14</source>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>630</fpage>
          -
          <lpage>645</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44] T. maintainers, contributors, Torchvision:
          <article-title>Pytorch's computer vision library</article-title>
          , https://github.com/pytorch/ vision,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>