<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Neural Vicinal Risk Minimization: Noise-robust Distillation for Noisy Labels</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Hyounguk</forename><surname>Shon</surname></persName>
							<email>hyounguk.shon@kaist.ac.kr</email>
							<affiliation key="aff0">
								<orgName type="department">Advanced Institute of Science and Technology (KAIST)</orgName>
								<orgName type="institution">Korea</orgName>
								<address>
									<addrLine>291 Daehak-ro, Yuseong-gu</addrLine>
									<postCode>34141</postCode>
									<settlement>Daejeon</settlement>
									<country key="KR">South Korea</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Seunghee</forename><surname>Koh</surname></persName>
							<email>seunghee1215@kaist.ac.kr</email>
							<affiliation key="aff0">
								<orgName type="department">Advanced Institute of Science and Technology (KAIST)</orgName>
								<orgName type="institution">Korea</orgName>
								<address>
									<addrLine>291 Daehak-ro, Yuseong-gu</addrLine>
									<postCode>34141</postCode>
									<settlement>Daejeon</settlement>
									<country key="KR">South Korea</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yunho</forename><surname>Jeon</surname></persName>
							<email>yhjeon@hanbat.ac.kr</email>
							<affiliation key="aff1">
								<orgName type="institution">Hanbat National University</orgName>
								<address>
									<addrLine>125, Dongseo-daero, Yuseong-gu</addrLine>
									<postCode>34158</postCode>
									<settlement>Daejeon</settlement>
									<country key="KR">South Korea</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Junmo</forename><surname>Kim</surname></persName>
							<email>junmo.kim@kaist.ac.kr</email>
							<affiliation key="aff0">
								<orgName type="department">Advanced Institute of Science and Technology (KAIST)</orgName>
								<orgName type="institution">Korea</orgName>
								<address>
									<addrLine>291 Daehak-ro, Yuseong-gu</addrLine>
									<postCode>34141</postCode>
									<settlement>Daejeon</settlement>
									<country key="KR">South Korea</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Neural Vicinal Risk Minimization: Noise-robust Distillation for Noisy Labels</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">F453CE5282FE354C135F8BB7669DE43F</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:21+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Learning with Label Noise</term>
					<term>Vicinal Risk Minization</term>
					<term>Noise-robust Loss</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Training deep neural networks with noisy supervision remains a challenging problem in weakly supervised learning. Mislabeled instances can severely degrade the generalization ability of classification models to unseen data. In this paper, we propose a novel regularization method coined Noise-robust Distillation (NRD) that addresses robust training under noisy supervision. NRD is motivated from a novel learning framework which we name Neural Vicinal Risk (NVR) minimization to improve the estimation quality of the data distribution and handle label noise effectively. Our framework is based upon our observation that a neural network has capability to correctly classify data sampled from vicinal distribution even when the model is overfitted to noisy label. By ensembling the predictions from the neural vicinal distribution, we obtain an accurate estimation of the class probabilities that reflects sample-wise class ambiguity. We validated our method through various noisy label benchmarks and demonstrate significant improvement in robustness to label noise.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Deep learning models have achieved remarkable success in various domains, including image classification, natural language processing, and speech recognition. However, the performance of these models heavily relies on the availability of high-quality labeled data for training. Obtaining accurately annotated labels can be a challenging and time-consuming task, often requiring human annotators to manually label large amounts of data. As a result, noisy labels may arise during the annotation process, leading to suboptimal model performance.</p><p>In this paper, we address noisy label learning as a subset of a more generic type of problem. This encompasses learning from an over-confident target probability distribution and image ambiguity <ref type="bibr" target="#b0">[1]</ref>, human annotation errors, multiple classes in an image, and out-of-distribution training examples <ref type="bibr" target="#b1">[2]</ref> that can naturally occur due to, for example, random crop data augmentation. We show that our generic noisy label supervision algorithm can address a combination of these issues using a simple and unified approach.</p><p>We propose a noise-robust learning algorithm named Noise-Robust Distillation (NRD) to address the issue of noisy supervision during training. NRD aims to improve the generalization performance of classification models by explicitly considering the noise and ambiguity in the training labels. We motivate NRD by a novel formulation of the noisy supervision learning problem which we name Neural Vicinal Risk (NVR) minimization.</p><p>This stems from the observation that deep neural networks have the inherent capability to detect and correct noisy supervision, even when it is trained using noisy supervision. This ability is particularly evident when considering the vicinal distribution, which represents the distribution generated from perturbed versions of the training data. Despite being trained on noisy labels, neural networks can still  accurately model the vicinal distribution, indicating their potential to correct the noisy supervision.</p><p>Our findings suggest that the combination of perturbation-based estimation and ensembling can lead to improved model performance, even in the presence of noisy supervision. Building on these insights, we propose Noise-Robust Distillation (NRD), which is a noise-robust learning method that leverages the neural vicinal risk principle to enhance the generalization performance of classification models trained on noisy labels.</p><p>The main contributions of this work are as follows:</p><p>• We introduce the Noise-Robust Distillation (NRD), a noise-robust learning approach that comprehensively addresses the challenges posed by noisy supervision during training. • NRD is motivated by a novel noise-robust learning framework which we name Neural Vicinal Risk (NVR) minimization. We show that NVR improves the estimation quality of the true class distribution and handles label noise effectively.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related works</head><p>Noisy label learning Numerous methods tackle the challenge of training Deep Neural Networks (DNNs) on datasets that contain a mix of correctly labeled and mislabeled samples, as discussed in <ref type="bibr" target="#b3">[4]</ref>. Some approaches focus on designing a noisy-robust loss to mitigate the impact of mislabeled samples. Mean Absolute Error (MAE) loss <ref type="bibr" target="#b4">[5]</ref> demonstrates competitive performance. Following this, the introduction of the Generalized Cross-Entropy (GCE), Symmetric Cross-Entropy (SCE) loss, and active passive loss are proposed with improved noisy-robustness. Generalized Jensen-Shannon divergence (GJS) <ref type="bibr" target="#b5">[6]</ref> enforces consistency between predictions from multiple augmented views of a sample to regularize training. Also, the principle of negative learning is emphasized by <ref type="bibr" target="#b6">[7,</ref><ref type="bibr" target="#b7">8]</ref>. The strategies inspired by the training dynamics of models <ref type="bibr" target="#b8">[9]</ref> such as early stopping <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11]</ref> or over-parameterization <ref type="bibr" target="#b11">[12]</ref> exploit the different convergence speeds of clean and noisy samples. Co-teaching <ref type="bibr" target="#b12">[13]</ref> involves simultaneous training of two DNNs, where each network learns from the clean samples chosen by its counterpart. Noise identification aims to filter noisy samples from the training dataset. Noisy samples can be filtered by measuring the degree of disagreement between ensemble models, which occurs once the model is overfitted to the noisy samples. Recent algorithms <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16]</ref> utilize the power of Semi-Supervised Learning (SSL) by following a two-step process: filtering out noisy labels first, and then treating the detected noisy samples as unlabeled for reducing the noisy learning problem into a SSL task. Semi-supervised learning (SSL) has emerged as a powerful method for noisy label learning. Among them, consis-tency regularization promotes a model to make consistent outputs across data augmentations, as in Π-model, Temporal Ensembling <ref type="bibr" target="#b16">[17]</ref> and Mean Teacher <ref type="bibr" target="#b17">[18]</ref>. Also, FixMatch <ref type="bibr" target="#b18">[19]</ref> integrates pseudo-labeling and and virtual adversarial training <ref type="bibr" target="#b19">[20]</ref> utilizes adversarial attacks. MixMatch <ref type="bibr" target="#b20">[21]</ref>, adopted by DivideMix <ref type="bibr" target="#b13">[14]</ref>, generates pseudo-label with sharpening for data-augmented unlabeled examples and mixes labeled and unlabeled data using MixUp <ref type="bibr" target="#b21">[22]</ref>.</p><p>Calibration and knowledge distillation Confidence calibration <ref type="bibr" target="#b22">[23]</ref> is the process of adjusting a model's predicted probabilities to better reflect the true likelihood. It is demonstrated that training a model with data augmentation like Mixup <ref type="bibr" target="#b21">[22]</ref> improves model calibration and robustness to noise <ref type="bibr" target="#b23">[24]</ref>. Meanwhile, Knowledge Distillation (KD) <ref type="bibr" target="#b24">[25]</ref> enhances the student model by transferring knowledge contained in the prediction of the teacher model, focusing on "dark" or "hidden" knowledge, including its confident and less confident predictions. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Preliminaries</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Empirical Risk</head><p>The expected risk 𝑅(𝜃) is defined as the average loss over 𝑝(𝑥, 𝑦),</p><formula xml:id="formula_0">𝑅(𝜃) = ∫︁ 𝑥,𝑦 ℓ(𝑥, 𝑦, 𝜃)𝑝(𝑥, 𝑦) 𝑑𝑥𝑑𝑦 .<label>(1)</label></formula><p>In practice, a dataset 𝒟 is used to mimic the true distribution 𝑝(𝑥, 𝑦), which leads to the empirical risk</p><formula xml:id="formula_1">𝑅 ˆ(𝜃) = ∫︁ 𝑥,𝑦 ℓ(𝑥, 𝑦, 𝜃)𝑝 ˆ(𝑥, 𝑦) 𝑑𝑥𝑑𝑦 .<label>(2)</label></formula><p>where the corresponding empirical distribution 𝑝 ˆ(𝑥, 𝑦) is a mixture of delta masses using the observed samples, and the class distribution is a one-hot distribution given by annotations,</p><formula xml:id="formula_2">𝑝 ˆ(𝑥, 𝑦) = 1 𝑛 𝑛 ∑︁ 𝑖=1 1 {𝑦=𝑦 𝑖 } 𝛿(𝑥 − 𝑥𝑖) .<label>(3)</label></formula><p>Our goal is to refine the estimation of the data distribution 𝑝(𝑥, 𝑦) by utilizing the empirical distribution 𝑝 ˆ(𝑥, 𝑦). A pivotal question that arises is how to enhance the approximation of the true risk 𝑅(𝜃) intrinsic to a classification model. As evidenced by Equation ( <ref type="formula" target="#formula_2">3</ref>), this task necessitates the accurate estimation of two orthogonal components present within the true distribution 𝑝(𝑥, 𝑦) = 𝑃 (𝑦|𝑥)𝑝(𝑥):</p><p>(1) the input distribution 𝑝(𝑥) and ( <ref type="formula" target="#formula_1">2</ref>) the corresponding conditional distribution 𝑃 (𝑦|𝑥).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Neural Empirical Risk</head><p>Estimating 𝑃 (𝑦|𝑥) as a one-hot distribution involves assigning a single class label per sample, which is vulnerable to human annotation errors. Unfortunately, it proves challenging to enhance or secure accurate supervision signals for 𝑃 (𝑦|𝑥), as this requires multiple human annotators reviewing the same image <ref type="bibr" target="#b0">[1]</ref> which is a prohibitively costly process. Nonetheless, enhancing the estimation quality of the true class distribution 𝑃 (𝑦|𝑥) can lead to further improvements in estimating and minimizing the true risk.</p><p>Neural Empirical Risk (NER) Instead of using Equation (3), we can choose to parameterize 𝑃 (𝑦|𝑥) by a neural network 𝑃 (𝑦|𝑥, 𝜑) to further improve the estimation quality. First, we factorize the data distribution as 𝑝(𝑥, 𝑦) = 𝑃 (𝑦|𝑥)𝑝(𝑥), and denote the corresponding empirical distributions as follows:</p><formula xml:id="formula_3">𝑝 ˆ(𝑥) = 1 𝑛 𝑛 ∑︁ 𝑖=1 𝛿(𝑥 − 𝑥𝑖) (4) 𝑃 ˆ(𝑦|𝑥𝑖) = 1 {𝑦=𝑦 𝑖 } .<label>(5)</label></formula><p>Instead of using 𝑃 ˆ(𝑦|𝑥), we choose to use a distribution parameterized by a neural network trained on 𝒟,</p><formula xml:id="formula_4">𝑃 (𝑦|𝑥, 𝒟) = ∫︁ 𝜑 𝑃 (𝑦|𝑥, 𝜑)𝑝(𝜑|𝒟) 𝑑𝜑 ,<label>(6)</label></formula><p>where 𝑝(𝜑|𝒟) is the distribution over the function class parameterized by neural network. By plugging Equation ( <ref type="formula" target="#formula_4">6</ref>) into 𝑝 ˆ(𝑥, 𝑦) = 𝑃 ˆ(𝑦|𝑥)𝑝 ˆ(𝑥), we define the neural empirical distribution 𝑝 ˆ𝜌 and the neural empirical risk 𝑅 ˆ𝜌 as</p><formula xml:id="formula_5">𝑝 ˆ𝜌(𝑥, 𝑦|𝒟) = 𝑃 (𝑦|𝑥, 𝒟)𝑝 ˆ(𝑥) (7) 𝑅 ˆ𝜌(𝜑) = ∫︁ 𝑥,𝑦 ℓ(𝑥, 𝑦, 𝜑)𝑝 ˆ𝜌(𝑥, 𝑦|𝒟) 𝑑𝑥𝑑𝑦 . (8)</formula><p>Here, we refer to the model 𝑃 (𝑦|𝑥, 𝜑) as the teacher network to distinguish from the model being trained, whose term is borrowed from knowledge distillation. This can provide better estimation quality than 𝑃 ˆ(𝑦|𝑥) as is often observed in knowledge distillation, which we view as an instance of NER minimization. Knowledge distillation is known to improve generalization and calibration performance due to the dark knowledge <ref type="bibr" target="#b24">[25]</ref>. However, when the model is over-fitted to the noisy label, it severely degrades the performance of estimating the class probabilities. Hence, in order to effectively utilize a neural network, it is necessary to employ a noise-robust method to accurately estimate the class probabilities in the presence of noisy labels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Vicinal risk for noise-robust learning</head><p>Our motivation is based on the Vicinal Risk Minimization (VRM) principle <ref type="bibr" target="#b25">[26]</ref>, which is an alternative approximation to 𝑝(𝑥, 𝑦). The vicinal distribution 𝑝𝜈 (𝑥 ˜, 𝑦 ˜) constructed from the data distribution is defined as</p><formula xml:id="formula_6">𝑝𝜈 (𝑥 ˜, 𝑦 ˜) = ∫︁ 𝑥,𝑦 𝜈(𝑥 ˜, 𝑦 ˜|𝑥, 𝑦)𝑝(𝑥, 𝑦) 𝑑𝑥𝑑𝑦 . (<label>9</label></formula><formula xml:id="formula_7">)</formula><p>where 𝜈(𝑥 ˜, 𝑦 ˜|𝑥, 𝑦) is the vicinity distribution around (𝑥, 𝑦). For example, <ref type="bibr" target="#b25">[26]</ref> used additive Gaussian noise 𝒩 (0, 𝜎 2 𝐼).</p><p>MixUp <ref type="bibr" target="#b23">[24]</ref> and CutMix <ref type="bibr" target="#b26">[27]</ref> chose stochastic interpolation between samples which has also shown its effectiveness in noisy label. <ref type="bibr" target="#b23">[24]</ref>. Using the dataset, Equation ( <ref type="formula" target="#formula_6">9</ref>) is replaced by the empirical distribution as</p><formula xml:id="formula_8">𝑝 ˆ𝜈 (𝑥 ˜, 𝑦 ˜) = ∫︁ 𝑥,𝑦 𝜈(𝑥 ˜, 𝑦 ˜|𝑥, 𝑦)𝑝 ˆ(𝑥, 𝑦) 𝑑𝑥𝑑𝑦 (10) = 1 𝑛 𝑛 ∑︁ 𝑖=1 𝜈(𝑥 ˜, 𝑦 ˜|𝑥𝑖, 𝑦𝑖) .<label>(11)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Neural Vicinal Risk (NVR)</head><p>We propose to further improve by using a neural network to robustly approximate the data distribution by modifying Equation <ref type="bibr" target="#b8">(9)</ref>. We propose the following approximate vicinal data distribution parameterized by a deep neural network 𝜑 which we name neural vicinal distribution 𝑝𝜋. During training, the predictions from the original views (student augmentation) is regularized using the predictions generated from unseen views (teacher augmentation). We use asymmetric augmentation policy so that the teacher augmentation generates novel views, and the stop-gradient operation ensures that the model does not memorize the views generated from the teacher augmentation.</p><formula xml:id="formula_9">𝑝𝜋(𝑥 ˜, 𝑦 ˜|𝒟) = 𝑃 (𝑦 ˜|𝑥 ˜; 𝒟)𝑝(𝑥 ˜) (12) = ∫︁ 𝜑 𝑃 (𝑦 ˜|𝑥 ˜, 𝜑)𝑑𝑝(𝜑|𝒟) ∫︁ 𝑥 𝜈(𝑥 ˜|𝑥)𝑑𝑝(𝑥)<label>(13)</label></formula><formula xml:id="formula_10">≈ ∫︁ 𝜑 𝑃 (𝑦 ˜|𝑥 ˜, 𝜑)𝑑𝛿(𝜑 − 𝜑 * ) ∫︁ 𝑥 𝜈(𝑥 ˜|𝑥)𝑑𝑝 ^(𝑥) (14) = 𝑃 (𝑦 ˜|𝑥 ˜, 𝜑 * ) 1 𝑛 𝑛 ∑︁ 𝑖=1 𝜈(𝑥 ˜|𝑥 𝑖 )<label>(15)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>Label correction behavior for memorized training examples using transformed views. The models are trained to perfectly memorize the noisy labels, then evaluated again for training set with ground-truth labels. Due to memorization, the GT accuracy for mislabeled instances is zero and the overall accuracy is bounded by the noise rate. However, averaging the predictions from the transformed inputs shifts the prediction of the noisy examples to the ground-truth. For the transformation, AutoAugment followed by RandomErasing was used. For the dataset, NoisyCIFAR-10 with symmmetric noise was used. Note that Equation ( <ref type="formula">17</ref>) is a parameterized version of Equation (11) using a deep neural network. Finally, the neural vicinal risk is,</p><formula xml:id="formula_11">𝑅 ˆ𝜋(𝜑) = ∫︁ 𝑥,𝑦 ℓ(𝑥, 𝑦, 𝜑)𝑝 ˆ𝜋(𝑥 ˜, 𝑦 ˜; 𝜑 * ) 𝑑𝑥𝑑𝑦 .<label>(18)</label></formula><p>Note that 𝜈(𝑥 ˜|𝑥) is distinct from the augmentation strategy applied to the model being trained. Similar to Equation ( <ref type="formula">8</ref>), we refer to the 𝜈(𝑥 ˜|𝑥) as teacher augmentation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.">Self-correction for memorized instances</head><p>We further discuss the behavior of the neural vicinal distribution over a noisy training dataset. Notably, when a training dataset includes mislabeled instances, a teacher neural network can overfit to these noisy labels, where the neural empirical risk minimization fails in mitigating the impact. Interestingly, we observe that the neural vicinal distribution exhibits robustness against label noise, effectively self-correcting incoherent labels within the training set.</p><p>To understand this phenomenon, we visualize the behavior of the neural vicinal distribution in Figure <ref type="figure">2a</ref>. Here, we compare the softmax scores from the augmented input samples, distinguishing between the neural empirical distribution (red marker) and the neural vicinal distribution (blue markers). For the transformation policy, the network was trained using random crop augmentation and AutoAugment <ref type="bibr" target="#b27">[28]</ref> is chosen as the vicinal distribution to generate the novel views. The top row shows clean instances and the bottom row shows mislabeled instances.</p><p>The visual analysis contrasts the softmax predictions from both seen and novel views of clean and mislabeled training instances. The self-correction of the neural vicinal distribution is instance-dependent which responds differently based on if an instance is clean or mislabeled. Notably, while the teacher network's predictions for the novel views tend to shift misclassified predictions towards ground truth, they remain consistent for clean samples. This suggests that the network outputs corrected predictions by dissociating the novel views from the memorized views.</p><p>Next, in Figure <ref type="figure" target="#fig_1">1</ref>, we analyzed the label correction behavior of the neural vicinal distribution over the dataset population. Note that the models are trained only using the noisy training set, without access to the ground-truth labels. Applying transformation (blue curve) significantly reduced the GT class cross-entropy loss compared to no transformation (red curve), and we observed a good separation between the two distributions. Also, Table <ref type="table">1</ref> shows the ground-truth accuracy for the training samples where we observed significant improvements for the mislabeled instances when transformation is applied.</p><p>We additionally observed that ensembling perturbed predictions enhances the calibration, as depicted in Figure <ref type="figure">2b</ref>. While the original model is heavily over-confident due to overfitting (red), vicinal prediction improves accuracy and reflects class ambiguities. (blue)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Method</head><p>Motivated from the observation in Section 3.5, we propose a novel learning method for noisy labels named Noise-robust Distillation (NRD). Our method is formulated as a simple loss function which makes it easy to employ in existing training pipeline.</p><p>For this, we combine the target loss with the neural vicinal risk loss as a regularization objective. We formulate the combined objectives into a triplet loss. We have found Jensen-Shannon divergence (JSD) to be effective which generalizes to a triplet loss. The JSD for three distributions is,</p><formula xml:id="formula_12">JSD 𝜋 (p1, p2, p3) = ∑︁ 𝑖 𝜋𝑖𝐷KL (p i ||m) ,<label>(19)</label></formula><p>where m = ∑︀ 𝑖 𝜋𝑖p i . The hyperparameter 𝜋 ∈ ∆ 2 is chosen to balance the importance weight between the distributions. Additionally, JS divergence is known to have a nice robustness property against label noise. <ref type="bibr" target="#b5">[6]</ref> showed that JS divergence simulates MAE loss <ref type="bibr" target="#b4">[5]</ref> in its asymptote.</p><p>Next, we derive our NRD objective step-by-step. By applying NVR to JSD loss, we have</p><formula xml:id="formula_13">ℒ(𝜃; 𝑥, y, 𝜑) = JSD 𝜋 (y, ys, y𝑡)<label>(20)</label></formula><formula xml:id="formula_14">ys = 𝑓 (𝑥, 𝜃)<label>(21)</label></formula><formula xml:id="formula_15">yt = E 𝜈(𝑥 ˜|𝑥) [𝑓 (𝑥 ˜, 𝜑)] ,<label>(22)</label></formula><p>assuming that we have a trained teacher network 𝜑. Here, y is the target label, ys is the model output and yt is the teacher network output. The loss is solved for min 𝜃 ℒ(y, ys, yt).</p><p>To improve noise-robustness, we can further employ an iterative distillation scheme which we repeat the strategy for multiple rounds of training. We set the teacher network as the model obtained from the previous training round, such that 𝜑𝑡 = 𝜃𝑡−1 at the 𝑡-th training round. Applying to Equation ( <ref type="formula" target="#formula_13">20</ref> </p><p>A student network obtained from previous training round is switched to the teacher role for next round. However, in practice, we found this to be unstable and difficult to converge. Instead, we take the exponential average of the historical models as the teacher and set 𝜑𝑡 = 𝜃 ¯𝑡−1.</p><formula xml:id="formula_17">𝜃 ¯𝑡 = 𝛽 • 𝜃 ¯𝑡−1 + (1 − 𝛽) • 𝜃𝑡 . (<label>24</label></formula><formula xml:id="formula_18">)</formula><p>For the decay rate, we simply set 𝛽 = 0.99 for all experiments. The aggregation reduces the variance of neural vicinal risk estimation caused by stochastic gradient, and we have empirically found that it effectively stabilizes the training and lead to faster convergence. </p><formula xml:id="formula_19">yt = E 𝜈(𝑥 ˜|𝑥) [︀ 𝑓 (𝑥 ˜, 𝜃 ¯)]︀ , (<label>(26)</label></formula><formula xml:id="formula_20">)<label>27</label></formula><p>with a slight abuse of notation for 𝜃 ¯, which is not an optimization variable but continuously updated after each SGD step. This is implemented by detaching yt from the backpropagation graph, which prevents the model from memorizing the teacher augmentation views. (stop-grad in Figure <ref type="figure" target="#fig_4">3</ref>) For Equation ( <ref type="formula" target="#formula_20">27</ref>), we found single sample per SGD step was sufficient. The overall architecture is illustrated in Figure <ref type="figure" target="#fig_4">3</ref> and the pseudocode is presented in Algorithm 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Experimental settings</head><p>Benchmarking datasets For synthetic label noise benchmarks, we used NoisyCIFAR-10, NoisyCIFAR-100 <ref type="bibr" target="#b28">[29]</ref>. For symmetric label noise, we randomly flip the ground truth label with a probability 𝜂 uniformly across all categories.</p><p>For asymmetric label noise, we follow the scheme in <ref type="bibr" target="#b29">[30]</ref>.</p><p>For NoisyCIFAR-10-asymm, we flip truck→automobile, bird→airplane, cat→dog, dog→cat, deer→horse. For NoisyCIFAR-100-asymm, within each superclass, we randomly replace a subclass label 𝑦𝑖 to adjacent subclass 𝑦𝑖 + 1 with probability 𝜂.</p><p>For the real-world benchmark, we used WebVision <ref type="bibr" target="#b30">[31]</ref> dataset. WebVision consists of 2.4M training examples collected via Google and Flickr image search. We used a miniaturized training set following <ref type="bibr" target="#b31">[32]</ref> which uses only the first 50 categories in the "Google" image set. Mini-WebVision consists of 66K training and 2.5K validation examples. We additionally evaluated the trained model on ImageNet <ref type="bibr" target="#b32">[33]</ref> validation set. The noise rate is known to be around 20%. Baseline methods For the CIFAR benchmarks, we compare against cross-entropy (CE), bootstrapping (BS) <ref type="bibr" target="#b33">[34]</ref>, label smoothing (LS) <ref type="bibr" target="#b34">[35]</ref>, symmetric cross-entropy (SCE) <ref type="bibr" target="#b35">[36]</ref>, generalized cross-entropy (GCE) <ref type="bibr" target="#b36">[37]</ref>, normalized loss (NCE+RCE) <ref type="bibr" target="#b37">[38]</ref>, Jensen-Shannon divergence (JS, GJS) <ref type="bibr" target="#b5">[6]</ref>.</p><p>For the WebVision benchmarks, we compared our method with the state-of-the-art methods including ELR+ <ref type="bibr" target="#b9">[10]</ref>, Di-videMix <ref type="bibr" target="#b13">[14]</ref>, and GJS <ref type="bibr" target="#b5">[6]</ref>. The baseline results were adopted from <ref type="bibr" target="#b5">[6]</ref>. Models PreActResNet-34 architecture <ref type="bibr" target="#b38">[39]</ref> is used for all experiments conducted on CIFAR-10/100 datasets. For We-bVision experiments, we used ResNet-50. All experiments were trained from random initialization. Augmentation policy For the CIFAR experiments, we followed <ref type="bibr" target="#b5">[6]</ref> and used RandAugment <ref type="bibr" target="#b39">[40]</ref> chained with Cutout <ref type="bibr" target="#b40">[41]</ref> for all methods. For the NRD teacher transformation, we used AugMix <ref type="bibr" target="#b41">[42]</ref> in all experiments. Hyperparameters For CIFAR-10/100 benchmarks, we used 400 epochs for each training. We used SGD optimizer with momentum 0.9 and weight decay of 10 −4 . Learning rates  were reduced by a factor of 0.1 after 200-th and 300-th epoch.</p><p>For WebVision benchmarks, we trained the network for 300 epochs. Learning rate was reduced by a factor of 0.1 after 150 and 250 epochs. Refer to Appendix A for hyperparameter configuration details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Results</head><p>Performance on noisy label benchmarks In Table <ref type="table" target="#tab_2">2</ref>, we show the performance of our method in comparison to robust loss functions. While most of the baselines shows inconsistent performance between symmetric and asymmetric noise types, our method shows consistent improvement across a wide range of noise rates and noise types. Notably, we significantly improve performance under high noise rate settings where GJS tend to underperform. For NoisyCIFAR-10 80% noise, we improve by 5%p over SCE, and for NoisyCIFAR-100-80%, we improve by 10%p over GCE. Furthermore, the results on large-scale real-world noisy label benchmark is shown in Table <ref type="table" target="#tab_3">3</ref>. Notably, we observed that our method outperforms over existing methods that uses two networks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Performance on clean datasets</head><p>The proposed method improves model generalization when applied to clean dataset training as seen in Table <ref type="table" target="#tab_4">4</ref>. This is because the training dataset contains visually ambiguous images that make it difficult to draw a clear decision boundary, and therefore the hard target distributions from the annotations serve as  ). Enforcing consistency does not fully prevent overfitting because the model memorizes the noisy labels after an extended number of epochs. In contrast, our method (NRD) effectively prevents memorization. NoisyCIFAR-100 dataset is used.</p><p>a type of noisy supervision signal. We show that applying NRD can regularize and improve the performance of the model.</p><p>Comparison to consistency regularization Consistency regularization used in GJS is a powerful technique for noiserobustness. While it is similar to NRD, however, it does not directly prevent memorization of noisy labels. Figure <ref type="figure" target="#fig_8">4</ref> shows that GJS suffers from overfitting when trained for an extended number of steps. This is shown by test accuracy decreasing after reaching a peak at an early epoch. In contrast, NRD significantly mitigates overfitting. Notably, in 80% noise rate setting, we improve GJS by 36%p. The key contributing factor is that our method uses stop-gradient which directly prevents the model from memorizing the views generated by the asymmetric augmentation policy. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Confidence calibration</head><p>In Figure <ref type="figure">5</ref>, we additionally evaluated the calibration performance. We observed that the regularization effect from NRD also improves calibration of the model. Our method shows consistent calibration performance across all noise rates, which aligns with the performance of our method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>Our work proposes Noise-Robust Distillation (NRD) which is a simple regularization objective that is designed to improve a wide range of noisy supervision problems in training. We motivate our method based on the novel formulation of Neural Vicinal Risk (NVR) minimization, which focuses on leveraging deep neural networks to improve empirical risk minimization under noisy supervision scenarios. A key insight of our work is the inherent capacity of deep neural networks to detect and correct mislabeled examples based on vicinal distribution, a feature we exploited to improve model predictions and calibration. We have validated our method on several noisy label learning benchmarks. The results show clear improvements in performance compared to the baselines under noisy supervision. These findings suggest that NRD offers an effective strategy for handling noisy supervision, leading to enhanced generalization performance of classification models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 6</head><p>Visualization of the model prediction over training. We randomly selected four distinct noisy samples from the training dataset, which corresponds to the four rows. The model is trained using RandomCrop and tested using RandomCrop-perturbed inputs (red) and AutoAugment-perturbed inputs (blue). Leftmost column shows the predicted confidence of the perturbed inputs with respect to the ground-truth classes. The figures on the right hand-side visualizes the softmax vectors projected onto a decagonal surface, which are analogous to Figure <ref type="figure">2a</ref>. At the early phase of the training, both red and blue markers predict the ground-truth class. However, as the training progresses and the model overfits to the noisy labels, the red markers predict the target label, whereas a significant portion of the blue markers predicts the ground-truth markers. This shows that unseen perturbation to the input can produce noise-robust learning signal for training. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Averaging prediction over the novel views of a mislabeled training instance effectively mitigates memorization. The model is trained on the noisy training set and tested again using the training examples. The histogram shows the distribution of cross-entropy loss with respect to the GT labels. Red curve corresponds to standard prediction, and blue curve corresponds to ensembling over transformation views. The right side shows a training example with its original view vs. the transformed novel views. The corresponding loss is marked as "1" and "2" on the histogram.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>( a )Figure 2 :</head><label>a2</label><figDesc>Figure 2: On Figure 2a, model prediction of noisy samples are more sensitive to perturbation with respect to the input. NoisyCIFAR-10 dataset is used. Markers indicate the softmax scores predicted from the model trained using random crop augmentation. Red markers (+) show predictions generated using the same augmentation policy used during training, and the blue markers (•) are generated using an unseen, stronger augmentation policy. The ten-class softmax scores are visualized by projecting onto a decagon using Equiradial Projection [3]. On Figure 2b, while the model itself is heavily mis-calibrated (red bars), ensembling the predictions of the perturbed inputs significantly improves the calibration. (blue bars) detect and correct mislabeled examples through sensitivity to perturbations in the input data, leading to improved model predictions and calibration. • We validate the effectiveness of NRD through experiments on benchmark datasets, showing clear improvements in model performance in comparison to standard training methods under noisy supervision.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>3. 1 .</head><label>1</label><figDesc>NotationsConsider a DNN classification model parameterized by 𝜃 ∈ Θ as 𝑓 (𝑥, 𝜃) : 𝒳 ↦ → ∆ 𝐶−1 which outputs a probability distribution 𝑃 (𝑦|𝑥; 𝜃). The input space is defined as 𝒳 = R 𝐻×𝑊 ×𝐶 where 𝐻, 𝑊, 𝐶 are the number of height, width, and color channels of the image data. ∆ 𝑘 indicates 𝑘-simplex. The model takes an image input 𝑥 ∈ 𝒳 and predicts a categorical distribution over 𝒴 = {1, 2, ..., 𝐶}. We denote an image augmentation operation as 𝒯 (𝑥) : 𝒳 → 𝒳 , and the training dataset as 𝒟 = {(𝑥𝑖, 𝑦𝑖)}𝑖. The loss function is defined as ℓ(𝑥, 𝑦, 𝜃) : 𝒳 × 𝒴 × Θ ↦ → R. 𝛿(•) is the Dirac delta function and 1 {•} is the indicator function.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Illustration of the proposed Noise-robust Distillation (NRD) architecture. 𝑥 is the input data to the neural network, and 𝑦 is the assigned target label. Red arrows show the gradient propagation path.During training, the predictions from the original views (student augmentation) is regularized using the predictions generated from unseen views (teacher augmentation). We use asymmetric augmentation policy so that the teacher augmentation generates novel views, and the stop-gradient operation ensures that the model does not memorize the views generated from the teacher augmentation.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head></head><label></label><figDesc>), 𝜃𝑡 = arg min 𝜃 ℒ(𝜃; 𝑥, y, 𝜃𝑡−1) .</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head></head><label></label><figDesc>Finally, we formally define our NRD training objective. To reduce the training cost, we simplify each training round into a single step of stochastic gradient descent. (SGD) This simplifies the algorithm from multi-staged process into a single-staged process, and significantly accelerates the training. The NRD objective is, ℒNRD(𝜃; 𝑥, y, 𝜃 ¯) = JSD 𝜋 (y, ys, yt) (25) ys = 𝑓 (𝑥, 𝜃)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Algorithm 1</head><label>1</label><figDesc>PyTorch-style pseudocode ema_model = ema(model) optimizer = sgd_optimizer(model) for x, y in dataloader: x_t = teacher_aug(x) x_s = student_aug(x) # disconnect from backprop y_t = ema_model(x_t).detach() y_s = model(x_s)# distance between predictions loss = js_div(y, y_s, y_t) loss.backward() optimizer.step() ema_model.update()</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4:Comparison of overfitting behavior in consistency regularization (GJS<ref type="bibr" target="#b5">[6]</ref>). Enforcing consistency does not fully prevent overfitting because the model memorizes the noisy labels after an extended number of epochs. In contrast, our method (NRD) effectively prevents memorization. NoisyCIFAR-100 dataset is used.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head>8 Figure 5 :</head><label>85</label><figDesc>Figure 5: Comparison of CE baseline and NRD models trained on NoisyCIFAR-10 with symmetric label noise, where 𝜂 denotes the noise rate. Expected calibration error (ECE) is measured. NRD training consistently improves calibration across a range of noise rates.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2</head><label>2</label><figDesc>Noisy label performance on synthetic noisy label benchmarks. We used NoisyCIFAR-10 and NoisyCIFAR-100 datasets. Values indicate clean test accuracy. All values are averaged over five independent runs. The best and second best results are highlighted in bold.</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell cols="2">NoisyCIFAR-10</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">NoisyCIFAR-100</cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell cols="2">Symmetric</cell><cell></cell><cell cols="2">Asymmetric</cell><cell></cell><cell cols="2">Symmetric</cell><cell></cell><cell cols="2">Asymmetric</cell></row><row><cell>Noise rate</cell><cell>20%</cell><cell>40%</cell><cell>60%</cell><cell>80%</cell><cell>20%</cell><cell>40%</cell><cell>20%</cell><cell>40%</cell><cell>60%</cell><cell>80%</cell><cell>20%</cell><cell>40%</cell></row><row><cell>CE</cell><cell>91.63</cell><cell>87.74</cell><cell>81.99</cell><cell>66.51</cell><cell>92.77</cell><cell>87.12</cell><cell>65.74</cell><cell>55.77</cell><cell>44.42</cell><cell>10.74</cell><cell>66.85</cell><cell>49.45</cell></row><row><cell>BS</cell><cell>91.68</cell><cell>89.23</cell><cell>82.65</cell><cell>16.97</cell><cell>93.06</cell><cell>88.87</cell><cell>72.92</cell><cell>68.52</cell><cell>53.80</cell><cell>13.83</cell><cell>73.79</cell><cell>64.67</cell></row><row><cell>LS</cell><cell>93.51</cell><cell>89.90</cell><cell>83.96</cell><cell>67.35</cell><cell>92.94</cell><cell>88.10</cell><cell>74.88</cell><cell>68.41</cell><cell>54.58</cell><cell>26.98</cell><cell>73.17</cell><cell>57.20</cell></row><row><cell>SCE</cell><cell>94.29</cell><cell>92.72</cell><cell>89.26</cell><cell>80.68</cell><cell>93.48</cell><cell>84.98</cell><cell>74.21</cell><cell>68.23</cell><cell>59.28</cell><cell>26.80</cell><cell>70.86</cell><cell>51.12</cell></row><row><cell>GCE</cell><cell>94.24</cell><cell>92.82</cell><cell>89.37</cell><cell>79.19</cell><cell>92.83</cell><cell>87.00</cell><cell>75.02</cell><cell>71.54</cell><cell>65.21</cell><cell>49.68</cell><cell>72.13</cell><cell>51.50</cell></row><row><cell>NCE+RCE</cell><cell>94.27</cell><cell>92.03</cell><cell>87.30</cell><cell>77.89</cell><cell>93.87</cell><cell>86.83</cell><cell>72.39</cell><cell>68.79</cell><cell>62.18</cell><cell>31.63</cell><cell>71.35</cell><cell>57.80</cell></row><row><cell>JS</cell><cell>94.52</cell><cell>93.01</cell><cell>89.64</cell><cell>76.06</cell><cell>92.18</cell><cell>87.99</cell><cell>75.41</cell><cell>71.12</cell><cell>64.36</cell><cell>45.05</cell><cell>71.70</cell><cell>49.36</cell></row><row><cell>GJS</cell><cell>95.33</cell><cell>93.57</cell><cell>91.64</cell><cell>79.11</cell><cell cols="2">93.94 89.65</cell><cell cols="3">78.05 75.71 70.15</cell><cell>44.49</cell><cell cols="2">74.60 63.70</cell></row><row><cell>NRD (ours)</cell><cell cols="4">95.43 94.65 92.45 85.32</cell><cell cols="2">93.90 91.25</cell><cell cols="4">78.54 76.29 72.43 60.01</cell><cell>76.07</cell><cell>61.40</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3</head><label>3</label><figDesc>Real-world noisy label benchmark on WebVision. The models are trained using the WebVision training set, and evaluated on WebVision and ImageNet validation sets. The values indicate accuracy. IRNv2 and RN50 indicates Inception-ResNet-V2 and ResNet-50, respectively. 𝑁 indicates the number of networks used.</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">WebVision</cell><cell cols="2">ImageNet</cell></row><row><cell>Method</cell><cell>Arch.</cell><cell>Aug.</cell><cell>𝑁</cell><cell cols="4">Top-1 Top-5 Top-1 Top-5</cell></row><row><cell>ELR+</cell><cell>IRNv2</cell><cell></cell><cell></cell><cell>77.78</cell><cell>91.68</cell><cell>70.29</cell><cell>89.76</cell></row><row><cell>DivideMix</cell><cell>IRNv2</cell><cell>MixUp</cell><cell>2</cell><cell>77.32</cell><cell>91.64</cell><cell>75.20</cell><cell>90.84</cell></row><row><cell>DivideMix</cell><cell>RN50</cell><cell></cell><cell></cell><cell>76.32</cell><cell>90.65</cell><cell>74.42</cell><cell>91.21</cell></row><row><cell>CE</cell><cell>RN50</cell><cell></cell><cell></cell><cell>70.69</cell><cell>88.64</cell><cell>67.32</cell><cell>88.00</cell></row><row><cell>JS GJS</cell><cell>RN50 RN50</cell><cell>ColorJitter</cell><cell>1</cell><cell>74.56 77.99</cell><cell>91.09 90.62</cell><cell>70.36 74.33</cell><cell>90.60 90.33</cell></row><row><cell>NRD (ours)</cell><cell>RN50</cell><cell></cell><cell></cell><cell cols="4">78.56 92.48 75.24 92.36</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 4</head><label>4</label><figDesc>Performance on clean CIFAR-10 and CIFAR-100 datasets. The values indicate test accuracy.</figDesc><table><row><cell></cell><cell></cell><cell></cell><cell></cell><cell cols="2">Method</cell><cell cols="3">CIFAR-10 CIFAR-100</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>CE</cell><cell></cell><cell></cell><cell>94.35</cell><cell>77.60</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>GCE</cell><cell></cell><cell></cell><cell>94.00</cell><cell>77.65</cell></row><row><cell></cell><cell></cell><cell></cell><cell></cell><cell>GJS</cell><cell></cell><cell></cell><cell>94.78</cell><cell>79.27</cell></row><row><cell></cell><cell></cell><cell></cell><cell cols="3">NRD (ours)</cell><cell></cell><cell>95.05</cell><cell>79.61</cell></row><row><cell>Test Accuracy</cell><cell>30 40 50 60 70 80 90</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell>Method and Noise Rate ( ) GJS =0.2 GJS =0.4 GJS =0.6 GJS =0.8 NRD =0.2 NRD =0.4 NRD =0.6 NRD =0.8</cell></row><row><cell></cell><cell>20</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>10</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>0</cell><cell>0</cell><cell>200</cell><cell>400</cell><cell>Epoch</cell><cell>600</cell><cell>800</cell><cell>1000</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Acknowledgement This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (No. RS-2023-00240379).</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Detailed hyperparameter configurations A.1. CIFAR-10/100 benchmarks</head><p>General training details For the network architecture, we use PreActResNet-34 <ref type="bibr" target="#b42">[43]</ref>. For training, we use SGD optimizer with momentum 0.9, a batch size of 128, and train for 400 epochs. The learning rate is reduced by 1/10 at 50% and 75% of the training iterations. Augmentation policy For data augmentation, we use Ran-dAugment <ref type="bibr" target="#b39">[40]</ref> with 𝑁 = 1, 𝑀 = 3 followed by random crop (size 32 and 4-pixel padding), random horizontal flip and Cutout <ref type="bibr" target="#b40">[41]</ref> with length 5. Hyperparameters See Table <ref type="table">5</ref> for the details. For the baselines, we follow the same hyperparameter configurations used by <ref type="bibr" target="#b5">[6]</ref>. 40% noise rate setting was used to find the best learning rates and weight decay rates. For the learning rates and weight decay rates for NRD, we used the same configurations as GJS. For the tuning of hyperparameters {𝜋1, 𝜋2, 𝜋3} in the NRD loss, we fixed 𝜋1 = 𝜋3 so that the targets y and yt have equal weight. We tuned 𝜋2 ∈ {0.1, 0.2, ..., 0.9}.</p><p>For the moving average decay rate, we used 𝛽 = 0.99 for </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2. WebVision benchmark</head><p>General training details For the network architecture, we use ResNet-50 with random initialization. For training, we use SGD optimizer with momentum 0.9, a batch size of 64, and train for 300 epochs. The initial learning rate was set to 0.1 and reduced by 1/10 after the 100-th and 200-th epoch.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Augmentation policy</head><p>For data augmentation, we use random resized crop with size 224, random horizontal flip, and color jitter. We used the color jitter implementation from TorchVision <ref type="bibr" target="#b43">[44]</ref> with brightness=0.4, contrast=0.4, satura-tion=0.4, hue=0.2. For the NRD teacher augmentation, we use AugMix <ref type="bibr" target="#b41">[42]</ref> followed by random resize crop with size 224 and random horizontal flip. Hyperparameters For the hyperparameters {𝜋1, 𝜋2, 𝜋3} in the NRD loss, we used 𝜋1 = 𝜋3 = 0.1 and 𝜋2 = 0.8. The moving average decay rate was set to 𝛽 = 0.99.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Training dynamics visualization of perturbed inputs</head><p>In this section, we provide the visualized trajectory of the model prediction of the perturbed inputs throughout training. (See Table <ref type="table">6</ref>) These are the same plots presented in Figure <ref type="figure">2a</ref>, albeit on different mid-training epochs. The model is trained using a standard training scheme with the crossentropy loss on the NoisyCIFAR-10-symm-40% dataset. We observe that a significant portion of the predictions perturbed using augmentation unseen at training (AutoAugment) gradually settles to the ground truth class, whereas the predictions perturbed using the same augmentation policy used at training (RandomCrop) eventually converge to the noisy target class. The result shows that predictions from the perturbation identical to the training augmentation (red markers) are non-noise-robust distillation targets, whereas the predictions from the unseen perturbation (blue markers) are noise-robust distillation targets.</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Is one annotation enough?-a datacentric image classification benchmark for noisy and ambiguous label estimation</title>
		<author>
			<persName><forename type="first">L</forename><surname>Schmarje</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Grossmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zelenka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Dippel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kiko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Oszust</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pastell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Stracke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Valros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Volkmann</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Relabeling imagenet: from single to multi-labels, from global to localized labels</title>
		<author>
			<persName><forename type="first">S</forename><surname>Yun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Oh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Heo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Choe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="2340" to="2350" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Lehman</surname></persName>
		</author>
		<ptr target="https://charlielehman.github.io/post/visualizing-tempscaling/" />
		<title level="m">Visualizing softmax</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Learning from noisy labels with deep neural networks: A survey</title>
		<author>
			<persName><forename type="first">H</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Shin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J.-G</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Neural Networks and Learning Systems</title>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Robust loss functions under label noise for deep neural networks</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ghosh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">S</forename><surname>Sastry</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence</title>
				<meeting>the Thirty-First AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Generalized jensenshannon divergence loss for learning with noisy labels</title>
		<author>
			<persName><forename type="first">E</forename><surname>Englesson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Azizpour</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="30284" to="30297" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Nlnl: Negative learning for noisy labels</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</title>
				<meeting>the IEEE/CVF International Conference on Computer Vision (ICCV)</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Joint negative and positive learning for noisy labels</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Shon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="9442" to="9451" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A closer look at memorization in deep networks</title>
		<author>
			<persName><forename type="first">D</forename><surname>Arpit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jastrzębski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ballas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Krueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">S</forename><surname>Kanwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Maharaj</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fischer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Courville</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Bengio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lacoste-Julien</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 34th International Conference on Machine Learning</title>
				<meeting>the 34th International Conference on Machine Learning<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="volume">70</biblScope>
			<biblScope unit="page" from="233" to="242" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Early-learning regularization prevents memorization of noisy labels</title>
		<author>
			<persName><forename type="first">S</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Niles-Weed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Razavian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Fernandez-Granda</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="20331" to="20342" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Understanding and improving early stopping for learning with noisy labels</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Bai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Mao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Niu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="24392" to="24403" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Robust training under label noise by over-parameterization</title>
		<author>
			<persName><forename type="first">S</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Qu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>You</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 39th International Conference on Machine Learning</title>
				<meeting>the 39th International Conference on Machine Learning<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">162</biblScope>
			<biblScope unit="page" from="14153" to="14172" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Co-teaching: Robust training of deep neural networks with extremely noisy labels</title>
		<author>
			<persName><forename type="first">B</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Niu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Tsang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sugiyama</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">31</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Dividemix: Learning with noisy labels as semi-supervised learning</title>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Socher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">C</forename><surname>Hoi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Neighborhood collective estimation for noisy label identification and correction</title>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European Conference on Computer Vision</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Centrality and consistency: Two-stage clean samples identification for learning with instance-dependent noisy labels</title>
		<author>
			<persName><forename type="first">G</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European Conference on Computer Vision</title>
				<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">13685</biblScope>
			<biblScope unit="page" from="21" to="37" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Temporal ensembling for semisupervised learning</title>
		<author>
			<persName><forename type="first">S</forename><surname>Laine</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Aila</surname></persName>
		</author>
		<ptr target="https://openreview.net/forum?id=BJ6oOfqge" />
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results</title>
		<author>
			<persName><forename type="first">A</forename><surname>Tarvainen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Valpola</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Fixmatch: Simplifying semi-supervised learning with consistency and confidence</title>
		<author>
			<persName><forename type="first">K</forename><surname>Sohn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Berthelot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Carlini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">A</forename><surname>Raffel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">D</forename><surname>Cubuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kurakin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-L</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="596" to="608" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Adversarial training methods for semi-supervised text classification</title>
		<author>
			<persName><forename type="first">T</forename><surname>Miyato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Dai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Goodfellow</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Mixmatch: A holistic approach to semi-supervised learning</title>
		<author>
			<persName><forename type="first">D</forename><surname>Berthelot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Carlini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Goodfellow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Papernot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Oliver</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">A</forename><surname>Raffel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">mixup: Beyond empirical risk minimization</title>
		<author>
			<persName><forename type="first">H</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cisse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">N</forename><surname>Dauphin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lopez-Paz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ternational Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">On calibration of modern neural networks</title>
		<author>
			<persName><forename type="first">C</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pleiss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<imprint>
			<publisher>PMLR</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1321" to="1330" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">On mixup training: Improved calibration and predictive uncertainty for deep neural networks</title>
		<author>
			<persName><forename type="first">S</forename><surname>Thulasidasan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chennupati</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Bilmes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Bhattacharya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Michalak</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">32</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<monogr>
		<author>
			<persName><forename type="first">G</forename><surname>Hinton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1503.02531</idno>
		<title level="m">Distilling the knowledge in a neural network</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">Vicinal risk minimization</title>
		<author>
			<persName><forename type="first">O</forename><surname>Chapelle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Weston</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bottou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vapnik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Cutmix: Regularization strategy to train strong classifiers with localizable features</title>
		<author>
			<persName><forename type="first">S</forename><surname>Yun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Oh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Choe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yoo</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF international conference on computer vision</title>
				<meeting>the IEEE/CVF international conference on computer vision</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="6023" to="6032" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Autoaugment: Learning augmentation strategies from data</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">D</forename><surname>Cubuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zoph</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mane</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vasudevan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<title level="m" type="main">Learning multiple layers of features from tiny images</title>
		<author>
			<persName><forename type="first">A</forename><surname>Krizhevsky</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
	<note type="report_type">Technical Report</note>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Making neural networks robust to label noise: a loss correction approach</title>
		<author>
			<persName><forename type="first">G</forename><surname>Patrini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rozza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Menon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Nock</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Qu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">stat</title>
		<imprint>
			<biblScope unit="volume">1050</biblScope>
			<biblScope unit="page">13</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Agustsson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Van Gool</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1708.02862</idno>
		<title level="m">Webvision database: Visual learning and understanding from web data</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Understanding and utilizing deep neural networks trained with noisy labels</title>
		<author>
			<persName><forename type="first">P</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><forename type="middle">B</forename><surname>Liao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="1062" to="1070" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Imagenet: A large-scale hierarchical image database</title>
		<author>
			<persName><forename type="first">J</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Socher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L.-J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fei-Fei</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Computer Vision and Pattern Recognition</title>
				<imprint>
			<date type="published" when="2009">2009. 2009</date>
			<biblScope unit="page" from="248" to="255" />
		</imprint>
	</monogr>
	<note>IEEE Conference on, IEEE</note>
</biblStruct>

<biblStruct xml:id="b33">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Reed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Anguelov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Szegedy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Erhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rabinovich</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1412.6596</idno>
		<title level="m">Training deep neural networks on noisy labels with bootstrapping</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Does label smoothing mitigate label noise?</title>
		<author>
			<persName><forename type="first">M</forename><surname>Lukasik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bhojanapalli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Menon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kumar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Machine Learning</title>
				<meeting><address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="6448" to="6458" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">Symmetric cross entropy for robust learning with noisy labels</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Luo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Yi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bailey</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)</title>
				<meeting>the IEEE/CVF International Conference on Computer Vision (ICCV)</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">Generalized cross entropy loss for training deep neural networks with noisy labels</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sabuncu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<analytic>
		<title level="a" type="main">Normalized loss functions for deep learning with noisy labels</title>
		<author>
			<persName><forename type="first">X</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Romano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Erfani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bailey</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 37th International Conference on Machine Learning</title>
				<meeting>the 37th International Conference on Machine Learning<address><addrLine>PMLR</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">119</biblScope>
			<biblScope unit="page" from="6543" to="6553" />
		</imprint>
	</monogr>
	<note>Proceedings of Machine Learning Research</note>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">Deep residual learning for image recognition</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVPR</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<analytic>
		<title level="a" type="main">Randaugment: Practical automated data augmentation with a reduced search space</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">D</forename><surname>Cubuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zoph</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shlens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)</title>
				<imprint>
			<date type="published" when="2019">2020. 2019</date>
			<biblScope unit="page" from="3008" to="3017" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b40">
	<monogr>
		<title level="m" type="main">Improved regularization of convolutional neural networks with cutout</title>
		<author>
			<persName><forename type="first">T</forename><surname>Devries</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">W</forename><surname>Taylor</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1708.04552</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b41">
	<analytic>
		<title level="a" type="main">AugMix: A simple data processing method to improve robustness and uncertainty</title>
		<author>
			<persName><forename type="first">D</forename><surname>Hendrycks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Mu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">D</forename><surname>Cubuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zoph</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gilmer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Lakshminarayanan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Conference on Learning Representations (ICLR)</title>
				<meeting>the International Conference on Learning Representations (ICLR)</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b42">
	<analytic>
		<title level="a" type="main">Identity mappings in deep residual networks</title>
		<author>
			<persName><forename type="first">K</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ren</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Computer Vision-ECCV 2016: 14th European Conference</title>
				<meeting><address><addrLine>Amsterdam, The Netherlands</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2016">October 11-14, 2016. 2016</date>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page" from="630" to="645" />
		</imprint>
	</monogr>
	<note>Proceedings, Part IV</note>
</biblStruct>

<biblStruct xml:id="b43">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Maintainers</surname></persName>
		</author>
		<ptr target="https://github.com/pytorch/vision,2016" />
		<title level="m">contributors, Torchvision: Pytorch&apos;s computer vision library</title>
				<imprint/>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
