<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">k-Most-Influential Neighbours: Noise-Resistant kNN</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Matt</forename><surname>Mallen</surname></persName>
							<email>mattmallen@protonmail.com</email>
							<affiliation key="aff0">
								<orgName type="department">School of Computer Science &amp; Information Te chnology</orgName>
								<orgName type="institution">University College Cork</orgName>
								<address>
									<country key="IE">Ireland</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Derek</forename><surname>Bridge</surname></persName>
							<email>d.bridge@cs.ucc.ie</email>
							<affiliation key="aff0">
								<orgName type="department">School of Computer Science &amp; Information Te chnology</orgName>
								<orgName type="institution">University College Cork</orgName>
								<address>
									<country key="IE">Ireland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">k-Most-Influential Neighbours: Noise-Resistant kNN</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">FFFC275D7E1E658D1021721A2DAB7A03</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T18:14+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>k-nearest Neighbours</term>
					<term>edited nearest neighbours</term>
					<term>instance selection</term>
					<term>noisy data</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The k-Nearest Neighbours (kNN) algorithm is a fundamental machine learning method known for its simplicity and effectiveness. However, it is notably sensitive to noisy data, which can significantly degrade its performance. We introduce k-Most-Influential Neighbours (kMIN), a novel variant designed to enhance noise resistance while adhering to the lazy learning paradigm of kNN. kMIN integrates reliability and similarity measures to obtain what we call influence. kMIN uses influence to identify neighbours and to weight their contributions when making predictions, mitigating the impact of unreliable training examples. We evaluate kMIN against traditional kNN, Weighted-kNN, and dataset editing algorithms for noise-filtering across multiple classification and regression datasets with varying noise levels. The results indicate that kMIN shows promise in noisy datasets, outperforming existing methods in several cases, but with further testing required to solidify these findings.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The k-Nearest Neighbours (kNN) algorithm, first developed by Fix and Hodges <ref type="bibr" target="#b0">[1]</ref>, is a staple in machine learning for both classification and regression tasks due to its simplicity and intuitive approach. It operates on the premise that similar problems have similar solutions; that is to say, it assumes that the target value or class of a query example can be predicted by aggregating the values/classes of the training examples that are closest in distance to it. Despite its advantages, kNN is notably sensitive to noisy (erroneous) or unrepresentative examples (outliers or atypical cases) in the training set <ref type="bibr" target="#b1">[2]</ref>. Noisy data can significantly skew the distance calculations and, consequently, the predictions. This sensitivity hampers kNN's performance in real-world applications where data imperfections are common. Methods that remove noise or that enhance kNN's resistance to noise are crucial for improving its applicability to practical scenarios, especially in safety-critical domains such as disease prediction <ref type="bibr" target="#b2">[3]</ref>.</p><p>There exist several dataset editing methods that attempt to identify and delete noisy examples from training sets, in preparation for use of kNN (see Section 2.1). However, they have limitations (Section 2.2), notably a focus on classification to the neglect of regression.</p><p>Instead, we seek to develop a variant of kNN that enhances kNN's resistance to noise. Robustness to noise ensures more reliable predictions, increases the algorithm's utility in diverse domains, and reduces the need for extensive dataset editing, which can be resource-intensive.</p><p>In this paper, we propose the k-Most-Influential Neighbours (kMIN) algorithm, a novel variant of kNN that incorporates measures of reliability and similarity into the neighbour selection and aggregation processes. kMIN aims to mitigate the influence of noisy training examples while departing as little as possible from kNN's lazy learning paradigm. Our key contributions include:</p><p>• Introducing a reliability metric based on an example's contribution to accurate predictions.</p><p>• Integrating reliability with similarity to adjust neighbour influence dynamically.</p><p>• Evaluating kMIN's performance against traditional kNN, Weighted kNN, and kNN on datasets that have been cleaned using dataset editing methods -on multiple datasets with varying noise levels.</p><p>The rest of this paper proceeds as follows:</p><p>• Section 2: Related Work -We review prior research on kNN and its variants, including Weighted-kNN; we present dataset editing algorithms for classification (in particular, RENN and BBNR); and, we show how to extend dataset editing algorithms to regression tasks. • Section 3: kMIN Algorithm -We introduce the k-Most-Influential Neighbours (kMIN) algorithm, defining reliability, similarity, and influence, and explain their roles in fetching and aggregating neighbours. • Section 4: Evaluation -We describe the datasets, experimental setup, and statistical methods used to compare kMIN with kNN, Weighted-kNN, and dataset editing algorithms on datasets with and without injection of artificial noise. • Section 5: Results -We present the experimental results, highlighting kMIN's performance in handling noise and compare it to existing methods in both classification and regression tasks. • Section 6: Conclusions, Limitations, and Future Work -We conclude with key findings, discuss limitations, and outline future research directions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>The standard kNN algorithm predicts the target value or class of a query example by identifying the k closest examples in the training set based on a distance metric, commonly Euclidean distance <ref type="bibr" target="#b0">[1]</ref>. For regression tasks, the prediction will be the mean of the neighbours' target values; for classification tasks, the predicted class will be determined by a vote among the neighbours. In Weighted-kNN, which is a variant of the standard kNN algorithm, neighbours have weights, which are inversely proportional to their distance from the query example, giving closer neighbours more influence <ref type="bibr" target="#b3">[4]</ref>. The prediction is then either a weighted mean or weighted vote of the neighbours' target values/classes. kNN and its variants are lazy learners: generalization takes place at inference time, when given a query example for which a prediction is required. There is no training step, prior to inference. In eager learners, by contrast, a model that generalizes the training data is learned in a step that precedes use of the model for inference. Lazy learners have the disadvantage of tending to be slower at inference time, since most, if not all, of the computation has been deferred. But they have the advantage of being easily incremental. When new training examples arrive, they can be inserted into the training set and will have immediate effect on subsequent inference. For eager learners to incorporate new training examples requires that models be re-built, which, because it is expensive, is usually only done periodically.</p><p>Nevertheless, lazy learners may include pre-processing steps that come prior to inference. For kNN, for example, pre-processing might involve the following: construction from the training data of a data structure, such as a kd-tree <ref type="bibr" target="#b4">[5]</ref>, to speed-up inference-time retrieval of the k-nearest neighbours; or computation of all-pairs similarities between training examples for faster inference-time identification of the k nearest-neighbours by an algorithm such as Fish-and-Shrink <ref type="bibr" target="#b5">[6]</ref>; or feature-selection, to identify irrelevant or noisy features to exclude them from computation of distances, or feature-weighting to ascribe importance weights to features for use in the distance function <ref type="bibr" target="#b6">[7]</ref>. Pre-processing might also involve deleting examples from the training set, and we discuss methods for doing this in more detail in the remainder of this section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Dataset Editing Algorithms for Classification</head><p>To address kNN's vulnerability to noise <ref type="bibr" target="#b1">[2]</ref>, various noise-reduction techniques have been developed. We refer to them as dataset editing algorithms. They clean datasets in a pre-processing step, prior to using kNN for inference. It is important to distinguish between two main goals of dataset editing algorithms: removing noise and reducing redundancy. Redundancy reduction is about discarding surplus data to improve efficiency. Our focus in this paper is with noise reduction. We will describe two such algorithms for classification tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.1.">Repeated Edited Nearest Neighbours (RENN)</head><p>One editing algorithm is Repeated Edited Nearest Neighbours (RENN), as first proposed by Tomek <ref type="bibr" target="#b7">[8]</ref>. RENN operates by repeatedly applying Wilson's Edited Nearest Neighbour (ENN) rule <ref type="bibr" target="#b8">[9]</ref>. The ENN rule involves examining each example in the training set and removing it if its class label does not agree with the majority class of its k-nearest neighbours. Unlike the basic ENN, which is applied once, RENN iteratively uses the ENN rule until the algorithm cannot remove any more examples. Repeated application of the ENN rule allows for deeper cleaning, ensuring that only the most representative and reliable examples remain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.2.">Blame-Based Noise Reduction (BBNR)</head><p>Another editing algorithm is Blame-Based Noise Reduction (BBNR) as first proposed by Delany and Cunningham <ref type="bibr" target="#b9">[10]</ref>. We describe and use version 2 of the algorithm <ref type="bibr" target="#b10">[11]</ref>. BBNR operates on the principle of 'blaming' examples that are frequently involved in incorrect kNN predictions. While RENN focuses on which examples kNN has misclassified, BBNR instead highlights which examples are helping to misclassify their neighbours.</p><p>BBNR assesses the impact of each training example on the overall prediction accuracy by finding its Liability Set. The Liability Set of example x is the set of other training examples that are misclassified by their k-nearest neighbours and where x was one of the incorrect majority neighbours. BBNR then removes the most problematic examples (those with largest Liability Sets) one by one. However, removal is only tentative. After removing an example from the training set, it carries out a check to see whether removal of the example has resulted in previously correctly classified examples now having erroneous predictions. If removal of the example has resulted in incorrect predictions, then the example is reinserted into the training set. To assist with this, BBNR uses the concept of a Coverage Set. The Coverage Set of an example x is the set of other training examples that are correctly classified by their k-nearest neighbours and where x was one of the correct majority neighbours. When BBNR tentatively removes an example x, then the check that was mentioned above needs to be applied to members of x's Coverage Set. We will also use the notion of a Coverage Set in defining kMIN (Section 3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Limitations of Existing Methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.1.">Deviation from Lazy Learning</head><p>Pre-processing steps, like the ones introduced by editing algorithms, such as BBNR and RENN, move away from the classic kNN paradigm. This may reduce some of the unique appeal of kNN that may make a designer opt for it in a given application context. In particular, the algorithm becomes less straightforwardly incremental. The editing algorithm must be re-run periodically, possibly even reinstituting examples that were previously deleted. Our kMIN algorithm also departs from strict laziness but it confines its departure to the minimal pre-calculation of reliability values without altering the training data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.2.">Over-elimination Risk</head><p>Aggressive noise filtering can remove valuable training examples, leading to information loss and potential degradation in predictive performance, particularly since filtering is a binary decision to keep or delete an example. Indeed, the original version of BBNR was found to be too aggressive on many datasets, leading to the more conservative version 2 <ref type="bibr" target="#b10">[11]</ref>. The decision of whether to keep or delete examples is typically based on reliability, but estimating reliability well in neighbourhoods of mixed quality remains challenging <ref type="bibr" target="#b11">[12]</ref>.</p><p>Unlike dataset editing algorithms, kMIN does not make binary decisions. Instead, it integrates reliability into the influence calculation, adjusting the contribution of examples dynamically without removing them. This helps avoid the risk of over-elimination and retains potentially useful examples.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.3.">Applicability Constraints</head><p>There has been a focus on editing algorithms for classification tasks, leaving editing algorithms for regression tasks relatively neglected. kMIN applies just as well for regression as it does for classification. To enable us to compare kMIN with RENN and BBNR on both classification tasks and regression tasks, we review the work on adapting RENN and BBNR to regression and propose our own variants in the next subsection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Dataset Editing Algorithms for Regression</head><p>Key to the operation of RENN and BBNR in classification tasks is knowing what it means for one training example to agree with others. In classification, agreement can be straightforwardly determined by whether the class labels of two examples match. Then, in RENN and BBNR, we are looking for examples whose class does not agree with the majority class of its k-nearest neighbours, with RENN focusing on the misclassified example and BBNR focusing on the neighbours that cause the misclassification.</p><p>For regression tasks, it is more challenging to define agreement. It would be much too strict to define it in terms of exactly equal target values. Instead, agreement will be when target values are close enough. But it is not clear how to draw the 'dividing line'.</p><p>Kordos &amp; Blachnik adapt the ENN rule to regression tasks: two target values agree if their absolute difference is less than or equal to some threshold <ref type="bibr" target="#b12">[13]</ref>. More specifically, they consider the absolute difference between an example's target value and the value predicted by its neighbours, which is the mean of the neighbours' target values. For the threshold θ, they use the standard deviation of the neighbours' target values, multiplied by some hyperparameter α. Martin et al. apply this definition of agreement to give versions of RENN, BBNR and a number of other dataset editing algorithms suitable for regression tasks <ref type="bibr" target="#b13">[14]</ref>.</p><p>In our experiments (Section 4), we wanted to include dataset editing algorithms for the regression datasets as well as the classification datasets. However, we use our own modified definition of agreement for regression tasks, which differs slightly from the one in <ref type="bibr" target="#b12">[13]</ref>. Where Kordos &amp; Blachnik compare an example's target value with the value predicted by its neighbours (the mean of their target values), we carry out an element-wise check: an example agrees with a neighbour if the absolute difference between their target values is within a certain threshold. For overall agreement, we retain the need for majority agreement, as specified in the original RENN and BBNR algorithms.</p><p>In addition, we also have an alternative way of defining the threshold. Where Kordos &amp; Blachnik use a threshold that is proportional (by an amount α) to the standard deviation of the neighbours' target values, we also consider a threshold that is proportional to the standard deviation of the target values of the whole training set. The choice between these two ways of calculating the threshold is a matter for model selection, using the validation sets (Section 4).</p><p>Our element-wise definition of agreement adheres more closely to the original ENN and dataset editing algorithms, such as RENN and BBNR, which rely on individual comparisons rather than aggregate statistics. We hypothesise that this fine-grained approach allows for a more precise identification of noisy instances in regression tasks. It might be interesting to conduct an empirical comparison between the two ways of computing agreement in regression tasks. We leave this to future work, since differences are likely to be small and the focus of this paper is on comparing dataset editing with our kMIN approach. In the experiments reported in Section 4, results for dataset editing on regression datasets use our element-wise approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">k-Most-Influential Neighbours (kMIN)</head><p>The k-Most-Influential Neighbours (kMIN) algorithm's guiding principle is that an example's neighbours should not contribute equally when predicting the example's target value or class. Specifically, it posits that neighbours that are more similar to the example and that have higher reliability should significantly influence the outcome -with the balance between reliability and similarity being a tunable hyperparameter.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Reliability, Similarity and Influence</head><p>When defining reliability, we take inspiration from the concept of a Coverage Set, which we explained in the context of BBNR earlier (Section 2.1.2). For kMIN, the reliability r of an example x is the size of its Coverage Set. So, if, using standard kNN, a training example helps to predict ten other training examples correctly, it has a reliability of 10. Here, when we say "predict correctly", we are referring to agreement (Section 2.3): where the examples share the same class for classification tasks, and where the absolute difference between the example's target values is less than or equal to threshold θ for regression.</p><p>When defining similarity, we use the inverse of Euclidean Distance (d): s(x, x ′ ) = 1/(d(x, x ′ ) + ϵ). We add a small constant, ϵ, to prevent division-by-zero. The influence of an example x ′ on an example x is a combination of the similarity between the two and the reliability of x ′ :</p><formula xml:id="formula_0">i(x, x ′ ) = (λ × s(x, x ′ )) + ((1 − λ) × r(x ′ )<label>(1)</label></formula><p>Here it becomes clear why we need to use similarity rather than distance. Distance would be 'at odds' with reliability: high influence comes with low distance (high similarity) and high reliability.</p><p>In fact, there remains a problem. To effectively tune hyperparameter λ requires that similarity s and reliability r have values that are comparable with each other. After some preliminary experimentation with several methods, we chose to standardize their values: we compute the mean and standard deviation of each from the training set, and then subtract this mean and divide by the standard deviation before using values in Equation <ref type="formula" target="#formula_0">1</ref>.</p><p>Having defined influence, there are now three ways that kMIN might use it: it can use it to fetch neighbours; it can use it when aggregating target values/classes of neighbours; or it can use it for both.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.1.">Using Influence to Fetch</head><p>For a given query example x, kMIN can retrieve the k training examples with highest influence over x. If λ is 0, then only reliability is considered; if λ is 1, then, from the point of view of fetching neighbours, kMIN behaves like normal kNN. Figure <ref type="figure" target="#fig_1">1</ref> helps visualise kMIN neighbour selection with a toy example (for k = 3). In Figure <ref type="figure" target="#fig_1">1a</ref>, with λ = 1, the algorithm selects neighbours purely based on distance, choosing nearby points, which, in the example, mostly happen to have low reliability scores and incorrectly classifying the query example as Class 1. In Figure <ref type="figure" target="#fig_1">1b</ref>, with λ = 0.31, the algorithm balances distance with reliability, selecting more distant but highly reliable points (reliability scores of 3 or more) and correctly classifying the query example as Class 0. This demonstrates how kMIN can overcome local noise by incorporating reliability into neighbour selection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.2.">Using Influence to Aggregate</head><p>Alternatively, kMIN can use influence when aggregating the target values/classes of the k-nearest neighbours. This is like Weighted-kNN. However, instead of calculating weights using the reciprocal of the distance, the weight of each neighbour is that neighbour's influence, as calculated with Equation 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.3.">Using Influence to Fetch and Aggregate</head><p>We can use influence in both steps: kMIN can fetch the k most influential training examples and then also weight each neighbour's contribution to the final prediction by its influence. In this case, there is no reason to believe that the optimal value for hyperparameter λ should be the same in the two steps. Hence, for influence-based fetching we use Equation <ref type="formula" target="#formula_0">1</ref>; and for weighted aggregation, we use the same formula but with λ replaced by a new hyperparameter λ ′ .  For a particular dataset, the decision about whether kMIN should use influence only for fetching neighbours (with hyperparameter λ), or only for aggregation (with hyperparameter λ ′ ), or for both (with hyperparameters λ and λ ′ ) is a matter for model selection, using the validation sets (Section 4).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Evaluation</head><p>In this section, we present details of an empirical comparison that we have conducted, with the results being in the next section.<ref type="foot" target="#foot_0">1</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Predictors and Datasets</head><p>On classification datasets, we compare standard kNN, Weighted-kNN, RENN+kNN, BBNR+kNN, and our own kMIN. For regression datasets, we do the same, but using our modified versions of RENN and BBNR (Section 2.3).</p><p>We use six UCI Machine Learning Repository datasets<ref type="foot" target="#foot_1">2</ref> for classification, and six for regression. We delete examples that have missing values; we standardize numeric-values features using means and standard deviations computed only from training data; and we one-hot encode categorical-valued features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Experimental Methodology</head><p>We evaluated all algorithms using nested 5-fold cross-validation. In nested cross-validation, each fold is used in turn as the test set, the remainder being the development set. But the development set is itself split into 5 folds, each of these being used in turn as the validation set, the remainder being the training set.</p><p>We use the validation sets for model selection and hyperparameter tuning. For classification, we must choose the value of k. We start with the square root of the size of the training set, and then repeatedly try values a small amount both sides of the current value, and we repeat until we home in on the best value. For kMIN, we decide whether to use influence for fetching neighbours, for aggregating neighbours' classes, or for both, and we must set λ and λ ′ as appropriate (Section 3). We try values of λ and λ ′ between 0 and 1 in increments of 0.05.</p><p>For regression, additionally, our modified versions of RENN and BBNR need values for θ, the agreement threshold. As explained in Section 2.3, θ is defined as α multiplied by either the standard deviation of the neighbours' target values or the training set target values. For α, we try integer values between 1 and 10 inclusive, the same values that were found to be effective in <ref type="bibr" target="#b12">[13]</ref>. Regression experiments are therefore 20 times more expensive than classification ones. To make them more feasible, we added stochasticity to the selection of k and, in kMIN, we try values of λ and λ ′ between 0 and 1 in increments of 0.1 instead of 0.05.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Statistical Significance Testing</head><p>To evaluate algorithm performance, we use accuracy for classification and mean absolute error (MAE) for regression. We apply statistical significance testing at the fold level, comparing performance across 5 cross-validation folds. P -values in Tables <ref type="table" target="#tab_1">1, 2</ref>, 6 and 7 are derived using the Friedman test, which checks for significant differences between algorithms across folds. A P -value below 0.05 indicates at least one algorithm differs significantly, warranting post hoc analysis using the Nemenyi test to determine specific pairwise differences. The Friedman test, a non-parametric alternative to repeated-measures ANOVA, is ideal for comparing multiple algorithms on heterogeneous datasets without assuming normality <ref type="bibr" target="#b14">[15]</ref>. If significant differences (P &lt; 0.05) are found, the Nemenyi test further identifies which algorithms differ. In Tables 3, 4 and 5, p-values indicate pairwise significance, with lower values (p &lt; 0.05) marking significant differences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Determining the Noise Level</head><p>To test noise resistance, we separately report results, first, where we do not inject artificial noise into the training sets, and, second, where we do inject artificial noise. For classification, we randomly alter the class label of a specified fraction of the training set to another class; in regression, we add Gaussian noise to the target value of a similarly specified subset of the training set.</p><p>The levels of noise were progressively increased. We consider noise levels of 10%, 20%, 30%, 40%, and 50%, commonly used in the literature <ref type="bibr" target="#b15">[16]</ref>. Evaluating each algorithm at every noise level would multiply the computational effort, given our large hyperparameter search space. Instead, for each dataset, we select the lowest noise level that results in a statistically significant performance degradation, according to a t-test and comparing against the baseline of no noise.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Classification Results</head><p>Table <ref type="table" target="#tab_0">1</ref> shows the accuracy of the algorithms on the datasets without injection of artificial noise. kMIN is competitive throughout, and it is the best performing algorithm on the Car dataset.</p><p>If the P -value in Table <ref type="table" target="#tab_0">1</ref> is below 0.05, then at least one of the algorithms is significantly different from the others. This is only true for the Car dataset. To know which difference or differences are significant, we consult Table <ref type="table" target="#tab_2">3</ref>. In this table, each cell represents the p-value from the comparison of a pair of algorithms (hence symmetric) using Nemenyi tests. A p-value less than 0.05 indicates a statistically significant performance difference. From Table <ref type="table" target="#tab_0">1</ref>, we know that kNN and kMIN outperform RENN; and, from Table <ref type="table" target="#tab_2">3</ref>, we see that these differences are significant.</p><p>Table <ref type="table" target="#tab_1">2</ref> shows the accuracy of the algorithms on the datasets when there is injection of artificial noise. kMIN is now more than competitive: it is the top performer in four of the six datasets, namely: Car, Heart, Votes and Zoo. However, of these four, there is no statistical significance (P ≥ 0.05) in Heart and Zoo. In fact, according to the Friedman test, there are statistically significant differences for Car, Iris and Wine. We investigate these further using pairwise Nemenyi tests. Table <ref type="table" target="#tab_3">4</ref> shows these for the Car dataset with artificial noise. From Table <ref type="table" target="#tab_1">2</ref>, we know that kNN and kMIN outperform BBNR; and, from Table <ref type="table" target="#tab_3">4</ref>, we see that these differences are statistically significant. We find exactly the same on the noisy Wine dataset: kNN and kMIN outperform BBNR (Table <ref type="table" target="#tab_1">2</ref>) and the differences are significant (Table <ref type="table" target="#tab_4">5</ref>). When it comes to the Iris dataset, despite the Friedman test indicating statistical significance (P = 0.0361), the Nemenyi test (which is more sensitive than Friedman) did not reveal any pairwise statistically significant differences; for this reason, we do not include a table of Nemenyi test p-values for the noisy Iris dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Regression Results</head><p>In Table <ref type="table" target="#tab_5">6</ref>, we see how the algorithms compare on the regression datasets without artificial noise. Again kMIN is competitive, bearing in mind that low values for MAE are better. It is never the best performing method. The Friedman test P -values in Table <ref type="table" target="#tab_5">6</ref> suggest that there is at least one significant difference for the Automobile, Auto MPG and Real Estate datasets (P &lt; 0.05). Pairwise Nemenyi tests reveal that: kNN is significantly better than RENN and BBNR for the Auto-MPG dataset; kNN and Weighted-kNN are both significantly better than RENN and BBNR for the Automobile dataset; and on Real Estate, Weighted-kNN is significantly better than BBNR. In no case, was kMIN significantly better or worse than the others. Given that kMIN is the focus of this paper and was not involved in any significant differences on these datasets, we have elected not to include the tables of Nemenyi p-values.</p><p>Table <ref type="table">7</ref> shows the MAE of the algorithms on the datasets when there is injection of artificial noise. kMIN is now more than competitive: it is the top performer in four of the six datasets, namely: Automobile, Auto MPG, Liver and Student (M). However, in no case was there statistical significance according to the Friedman test (P ≥ 0.05 in all cases) and so there is no reason to show tables of Nemenyi p-values.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions, Limitations and Future Work</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Conclusions</head><p>Our empirical comparison reveals several key insights. In noise-free environments, kNN and Weighted-kNN consistently demonstrated robust performance across classification and regression tasks. The dataset editing algorithms, RENN and BBNR, did not consistently outperform the simpler kNN variants irrespective of whether we injected noise or not.</p><p>kMIN showed promise in handling noise for both classification and regression tasks, being the top performer in four of six datasets for both tasks, when we introduced artificial noise. However, in only a few cases were the results statistically significant. The post hoc analysis revealed that kMIN was statistically significantly better than BBNR for two datasets.</p><p>Overall, the findings are not encouraging for the dataset editing algorithms that we have tried -at least on these datasets. On the other hand, the results do hint at kMIN's potential in noisy data conditions (especially given it was the top performer on eight of twelve experiments where we injected artificial noise). However, given the lack of convincingly statistically significant results, further investigation is warranted to understand its advantages and fully optimise its performance. In particular, we need experiments on larger datasets, where it may be easier to establish statistical significance (Section 6.2.1).</p><p>As things stand, our findings show the importance of selecting the appropriate algorithm, based on the specific context of use, particularly the presence of noise in the data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">Limitations</head><p>This section summarises what we believe to be the limitations inherent to our approach and potential methods of overcoming them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.1.">Statistical Significance</head><p>Pursuing statistical significance in our experimental results faced several challenges, limiting the robustness of our findings. Key among these was the scale and scope of the hyperparameter search space. To enhance the likelihood of obtaining statistically significant outcomes, we propose three strategies:</p><p>• Increased Sample Size: Expanding the number of datasets (especially if we can include larger datasets) or replicating experiments across more folds could enhance the statistical power of our analyses, making it easier to detect actual differences between algorithms. • Refined Hyperparameter optimisation: Implementing more sophisticated hyperparameter search techniques, such as Bayesian optimisation, could lead to better-tuned models, thereby increasing the performance differential. As highlighted by Turner et al. <ref type="bibr" target="#b16">[17]</ref>, employing Bayesian optimisation for hyperparameter tuning can significantly enhance model performance over traditional search methods. • Alternative Statistical Tests: Employing more sensitive statistical tests or adjusting significance levels may uncover subtle but meaningful differences between models.</p><p>While potentially increasing the computational complexity or resource demands of the research, these approaches could significantly improve the clarity and robustness of our conclusions regarding the comparative performance of the algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.2.">Diversity of Error Estimates</head><p>A limitation of our research is the reliance on a single error metric (MAE for regression and accuracy for classification) to evaluate algorithm performance. This approach may only partially capture the nuanced behaviours and strengths of the various algorithms under different conditions. For instance, metrics such as the root mean squared error (RMSE) could provide additional insight into regression performance by penalising significant errors more heavily. Similarly, precision, recall, and the F1 score could offer a more detailed understanding of performance in classification tasks, especially in imbalanced datasets where accuracy may not fully reflect an algorithm's effectiveness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.3.">Different Kinds of Noise</head><p>Our approach to introducing artificial noise changed the target values and class labels, which may not fully capture the algorithm's performance under other noisy conditions found in real-world data, including feature noise, outliers, systematic noise, and missing values that may influence the algorithms differently. Moreover, our relatively clean and well-curated UCI Machine Learning Repository datasets could provide relatively weak indications of performance in the face of the challenges of inherently noisy real-world datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.">Future Work</head><p>This section summarises potential future avenues of research that may merit exploration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.1.">Incremental Reliability Calculations</head><p>In dynamic environments, new training data is likely to become available over time. The new training examples would require reliability scores and, especially in cases of concept drift, it is likely that the new training data would invalidate some of the reliability scores previously calculated. Calculating reliability scores incrementally could help address these problems in a more computationally efficient manner. Rather than re-evaluating reliability scores for the entire dataset, the algorithm could pinpoint and update only the scores of directly affected examples, significantly enhancing efficiency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.2.">Evaluating Computational Efficiency</head><p>While we believe kMIN adheres closer to the lazy learning model than the editing algorithms, it still introduces a pre-processing step, not present in standard kNN, viz. the calculation of reliability scores, and the standardization of similarity and reliability scores. It would be interesting to compare the cost of kMIN's pre-processing with the cost of the dataset editing algorithms, bearing in mind that the dataset editing algorithms, by deleting training examples, thereby reduce inference costs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.3.">Other Definitions of Reliability</head><p>An intriguing area for future exploration involves enhancing the reliability scores used in kMIN. One potential avenue of research would be using some invrese of the size of each example's Liability Set as the basis for these scores as opposed to the size of the Coverage Set, or even some combination of the two. Another possibility could be to use a recurrence relation, akin to PageRank <ref type="bibr" target="#b17">[18]</ref>, as proposed in <ref type="bibr" target="#b18">[19]</ref>. This adaptation would assess an example's reliability in terms of the reliability of its neighbours.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>(a) λ = 1.00: Using only similarity (b) λ = 0.31: Favouring reliability over similarity</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Visualisation of how kMIN selects neighbours with different λ values (for k = 3). The query example has a black outline. Numbers inside points indicate reliability scores. Dashed lines show the selected neighbours.</figDesc><graphic coords="6,306.67,68.79,216.61,223.79" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Accuracy on Classification Datasets without Artificial Noise (The best performing algorithm for each dataset is highlighted in bold) (Datasets where at least one algorithm differs significantly from another (P &lt; 0.05) have underlined P -values)</figDesc><table><row><cell>Dataset</cell><cell>kNN</cell><cell cols="3">W.kNN RENN BBNR kMIN P -value</cell></row><row><cell>Car</cell><cell>0.9109</cell><cell>0.9057</cell><cell>0.8721 0.8947 0.9178</cell><cell>0.0015</cell></row><row><cell>Heart</cell><cell cols="2">0.8214 0.8214</cell><cell>0.8114 0.7742 0.8012</cell><cell>0.0551</cell></row><row><cell>Iris</cell><cell>0.9400</cell><cell>0.9667</cell><cell>0.9533 0.9600 0.9600</cell><cell>0.1024</cell></row><row><cell>Votes</cell><cell>0.9139</cell><cell>0.9139</cell><cell>0.9098 0.9315 0.9226</cell><cell>0.5878</cell></row><row><cell>Wine</cell><cell>0.9776</cell><cell>0.9832</cell><cell>0.9606 0.9778 0.9492</cell><cell>0.0708</cell></row><row><cell>Zoo</cell><cell>0.9105</cell><cell>0.9600</cell><cell>0.9105 0.9700 0.9405</cell><cell>0.1104</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Accuracy on Classification Datasets with Artificial Noise (The level of noise that we injected is shown in parentheses) (The best performing algorithm for each dataset is highlighted in bold) (Datasets where at least one algorithm differs significantly from another (P &lt; 0.05) have underlined P -values)</figDesc><table><row><cell>Dataset</cell><cell cols="4">kNN W.kNN RENN BBNR kMIN P -value</cell></row><row><cell>Car (10%)</cell><cell>0.8941</cell><cell>0.8981</cell><cell>0.8646 0.8356 0.8999</cell><cell>0.0032</cell></row><row><cell cols="2">Heart (40%) 0.6673</cell><cell>0.6403</cell><cell>0.6866 0.6266 0.7473</cell><cell>0.4702</cell></row><row><cell>Iris (50%)</cell><cell>0.7467</cell><cell>0.6667</cell><cell>0.7733 0.5533 0.7667</cell><cell>0.0361</cell></row><row><cell cols="2">Votes (50%) 0.4735</cell><cell>0.5003</cell><cell>0.6463 0.5089 0.6812</cell><cell>0.1571</cell></row><row><cell cols="3">Wine (30%) 0.9270 0.9381</cell><cell>0.8937 0.7359 0.9379</cell><cell>0.0116</cell></row><row><cell>Zoo (50%)</cell><cell>0.7224</cell><cell>0.6638</cell><cell>0.6929 0.4838 0.7433</cell><cell>0.3240</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Pairwise p-values for the Nemenyi Post Hoc Test Results for the Car Dataset without Artificial Noise (Significant differences have underlined p-values)</figDesc><table><row><cell>Algorithm</cell><cell cols="3">kNN W.kNN RENN BBNR kMIN</cell></row><row><cell>kNN</cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">Weighted kNN 0.7811</cell><cell></cell><cell></cell></row><row><cell>RENN</cell><cell>0.0120</cell><cell>0.2200</cell><cell></cell></row><row><cell>BBNR</cell><cell>0.2659</cell><cell>0.8946</cell><cell>0.7243</cell></row><row><cell>kMIN</cell><cell>0.9000</cell><cell>0.6108</cell><cell>0.0042 0.1447</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4</head><label>4</label><figDesc>Pairwise p-values for the Nemenyi Post Hoc Test Results for Car Dataset with Artificial Noise (Significant differences have underlined p-values)</figDesc><table><row><cell cols="4">Algorithm kNN W.kNN RENN BBNR kMIN</cell></row><row><cell>kNN</cell><cell></cell><cell></cell><cell></cell></row><row><cell>W.kNN</cell><cell>0.9000</cell><cell></cell><cell></cell></row><row><cell>RENN</cell><cell>0.3172</cell><cell>0.0904</cell><cell></cell></row><row><cell>BBNR</cell><cell>0.0904</cell><cell>0.0166</cell><cell>0.9000</cell></row><row><cell>kMIN</cell><cell>0.9000</cell><cell>0.9000</cell><cell>0.1795 0.0409</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5</head><label>5</label><figDesc>Pairwise p-values for the Nemenyi Post Hoc Test Results for Wine Dataset with Artificial Noise (Significant differences have underlined p-values)</figDesc><table><row><cell cols="4">Algorithm kNN W.kNN RENN BBNR kMIN</cell></row><row><cell>kNN</cell><cell></cell><cell></cell><cell></cell></row><row><cell>W.kNN</cell><cell>0.9000</cell><cell></cell><cell></cell></row><row><cell>RENN</cell><cell>0.8378</cell><cell>0.6676</cell><cell></cell></row><row><cell>BBNR</cell><cell>0.0705</cell><cell>0.0306</cell><cell>0.4971</cell></row><row><cell>kMIN</cell><cell>0.9000</cell><cell>0.9000</cell><cell>0.6676 0.0306</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 6</head><label>6</label><figDesc>MAE on Regression Datasets without Artificial Noise (The best performing algorithm for each dataset is highlighted in bold) (Datasets where at least one algorithm differs significantly from another (P &lt; 0.05) have underlined P -values)</figDesc><table><row><cell>Dataset</cell><cell>kNN</cell><cell cols="3">W.kNN RENN BBNR kMIN P -value</cell></row><row><cell cols="3">Automobile 0.4657 0.4657</cell><cell>0.7356 0.7299 0.6601</cell><cell>0.0011</cell></row><row><cell>Auto MPG</cell><cell cols="2">0.1225 0.1225</cell><cell>0.1761 0.1817 0.1700</cell><cell>0.0014</cell></row><row><cell>Liver</cell><cell>2.4089</cell><cell>2.3297</cell><cell>2.3991 2.3588 2.3504</cell><cell>0.0527</cell></row><row><cell>Real Estate</cell><cell>5.7758</cell><cell>5.3359</cell><cell>5.6789 5.7766 5.7107</cell><cell>0.0250</cell></row><row><cell cols="2">Student (M) 3.2467</cell><cell>3.2612</cell><cell>3.2531 3.3563 3.2888</cell><cell>0.0652</cell></row><row><cell>Student (P)</cell><cell>2.0946</cell><cell>2.0816</cell><cell>2.0841 2.0855 2.0979</cell><cell>0.1882</cell></row><row><cell>Table 7</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">MAE on Regression Datasets with Artificial Noise</cell><cell></cell><cell></cell></row><row><cell cols="4">(The level of noise that we injected is shown in parentheses)</cell><cell></cell></row><row><cell cols="4">(The best performing algorithm for each dataset is highlighted in bold)</cell><cell></cell></row><row><cell cols="5">(Datasets where at least one algorithm differs significantly from another (P &lt; 0.05) have underlined P -values)</cell></row><row><cell>Dataset</cell><cell cols="4">kNN W.kNN RENN BBNR kMIN P -value</cell></row><row><cell cols="2">Automobile (40%) 0.7809</cell><cell>0.7375</cell><cell>0.7577 0.7273 0.7163</cell><cell>0.5781</cell></row><row><cell>Auto MPG (10%)</cell><cell>0.2297</cell><cell>0.2297</cell><cell>0.2560 0.2575 0.2008</cell><cell>0.1249</cell></row><row><cell>Liver (50%)</cell><cell>2.3524</cell><cell>2.3547</cell><cell>2.4375 2.4298 2.3392</cell><cell>0.1933</cell></row><row><cell cols="3">Real Estate (50%) 6.1441 5.9541</cell><cell>6.2349 6.0926 6.0859</cell><cell>0.2598</cell></row><row><cell>Student (M,40%)</cell><cell>3.3037</cell><cell>3.3322</cell><cell>3.3578 3.3698 3.2702</cell><cell>0.1712</cell></row><row><cell>Student (P,10%)</cell><cell>2.1110</cell><cell>2.0986</cell><cell>2.1254 2.0858 2.1089</cell><cell>0.1424</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">For reproducibility, all source code is available on a public GitHub repository: https://github.com/mattmallencode/NcKNN.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://archive.ics.uci.edu/</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work has been supported by a grant from Science Foundation Ireland (SFI) under Grant Number 12/RC/2289-P2, which is co-funded under the European Regional Development Fund. We are grateful to the anonymous reviewers for their comments.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Discriminatory analysis -nonparametric discrimination: Consistency properties</title>
		<author>
			<persName><forename type="first">E</forename><surname>Fix</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Hodges</surname></persName>
		</author>
		<idno type="DOI">10.2307/1403797</idno>
		<idno>-49-004</idno>
	</analytic>
	<monogr>
		<title level="j">International Statistical Review</title>
		<imprint>
			<biblScope unit="volume">57</biblScope>
			<biblScope unit="page" from="238" to="247" />
			<date type="published" when="1951">1989. 1951</date>
		</imprint>
		<respStmt>
			<orgName>USAF School of Aviation Medicine, Randolph Field</orgName>
		</respStmt>
	</monogr>
	<note>Originally published as Report Number 4, Project Number 21</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Enhancing K-nearest neighbor algorithm: A comprehensive review and performance analysis of modifications</title>
		<author>
			<persName><forename type="first">R</forename><surname>Halder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Uddin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Uddin</surname></persName>
		</author>
		<idno type="DOI">10.1186/s40537-024-00973-y</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Big Data</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page" from="1" to="25" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction</title>
		<author>
			<persName><forename type="first">S</forename><surname>Uddin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Haque</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Moni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Gide</surname></persName>
		</author>
		<idno type="DOI">10.1038/s41598-022-10358-x</idno>
	</analytic>
	<monogr>
		<title level="j">Sci. Rep</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">The distance-weighted k-nearest-neighbor rule</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">A</forename><surname>Dudani</surname></persName>
		</author>
		<idno type="DOI">10.1109/TSMC.1976.5408784</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Systems, Man, and Cybernetics SMC</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="325" to="327" />
			<date type="published" when="1976">1976</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">An algorithm for finding best matches in logarithmic expected time</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Friedman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">L</forename><surname>Bentley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">A</forename><surname>Finkel</surname></persName>
		</author>
		<idno type="DOI">10.1145/355744.355745</idno>
	</analytic>
	<monogr>
		<title level="j">ACM Trans. Math. Softw</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="209" to="226" />
			<date type="published" when="1977">1977</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Fish and shrink. a next step towards efficient case retrieval in large-scale case bases</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Schaaf</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Third European Workshop on Advances in Case-Based Reasoning</title>
				<meeting>the Third European Workshop on Advances in Case-Based Reasoning</meeting>
		<imprint>
			<publisher>Springer-Verlag</publisher>
			<date type="published" when="1996">1996</date>
			<biblScope unit="page" from="362" to="376" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">W</forename><surname>Aha</surname></persName>
		</author>
		<idno type="DOI">10.1016/0020-7373(92)90018-G</idno>
	</analytic>
	<monogr>
		<title level="j">International Journal of Man-Machine Studies</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="page" from="267" to="287" />
			<date type="published" when="1992">1992</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">An experiment with the edited nearest-neighbor rule</title>
		<author>
			<persName><forename type="first">I</forename><surname>Tomek</surname></persName>
		</author>
		<idno type="DOI">10.1109/TSMC.1976.4309523</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Trans. Syst. Man Cybern. SMC</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="page" from="448" to="452" />
			<date type="published" when="1976">1976</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Asymptotic properties of nearest neighbor rules using edited data</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">L</forename><surname>Wilson</surname></persName>
		</author>
		<idno type="DOI">10.1109/TSMC.1972.4309137</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Systems, Man, and Cybernetics SMC</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="408" to="421" />
			<date type="published" when="1972">1972</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">An analysis of case-base editing in a spam filtering system</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Delany</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Cunningham</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-540-28631-8_11</idno>
	</analytic>
	<monogr>
		<title level="m">Procs. of the Seventh European Conference omn Case-Based Reasoning</title>
				<meeting>s. of the Seventh European Conference omn Case-Based Reasoning</meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="128" to="141" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Blame-based noise reduction: An alternative perspective on noise reduction for lazy learning</title>
		<author>
			<persName><forename type="first">F.-X</forename><surname>Pasquier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-J</forename><surname>Delany</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Cunningham</surname></persName>
		</author>
		<idno>TCD-CS-2005-29</idno>
		<imprint>
			<date type="published" when="2005">2005</date>
		</imprint>
	</monogr>
	<note type="report_type">Computer Science Technical Report</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Never judge a case by its (unreliable) neighbors: Estimating case reliability for CBR</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">P</forename><surname>Parsodkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Deepak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chakraborti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Case-Based Reasoning Research and Development</title>
				<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer International Publishing</publisher>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="256" to="270" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Instance selection with neural networks for regression problems</title>
		<author>
			<persName><forename type="first">M</forename><surname>Kordos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Blachnik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Artificial Neural Networks and Machine Learning -ICANN 2012</title>
				<meeting><address><addrLine>Berlin Heidelberg; Berlin, Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="263" to="270" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">On the regressand noise problem: Model robustness and synergy with regression-adapted noise filters</title>
		<author>
			<persName><forename type="first">J</forename><surname>Martín</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">A</forename><surname>Sáez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Corchado</surname></persName>
		</author>
		<idno type="DOI">10.1109/ACCESS.2021.3123151</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Access</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="145800" to="145816" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Statistical comparisons of classifiers over multiple data sets</title>
		<author>
			<persName><forename type="first">J</forename><surname>Demšar</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page" from="1" to="30" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Noise in datasets: What are the impacts on classification performance?</title>
		<author>
			<persName><forename type="first">R</forename><surname>Hasan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">H</forename><surname>Chu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Pattern Recognition Applications and Methods</title>
				<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="163" to="170" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Bayesian optimization is superior to random search for machine learning hyperparameter tuning: Analysis of the black-box optimization challenge</title>
		<author>
			<persName><forename type="first">R</forename><surname>Turner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Eriksson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Mccourt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kiili</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Laaksonen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">M</forename><surname>Guyon</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Neural Information Processing Systems</title>
				<imprint>
			<date type="published" when="2020">2020. 2021</date>
			<biblScope unit="page" from="3" to="26" />
		</imprint>
	</monogr>
	<note>Conference Proceedings</note>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">The anatomy of a large-scale hypertextual web search engine</title>
		<author>
			<persName><forename type="first">S</forename><surname>Brin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Page</surname></persName>
		</author>
		<idno type="DOI">10.1016/S0169-7552(98)00110-X</idno>
	</analytic>
	<monogr>
		<title level="m">International World Wide Web Conference</title>
				<imprint>
			<date type="published" when="1998">1998</date>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="page" from="107" to="117" />
		</imprint>
	</monogr>
	<note>proceedings of the Seventh</note>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">The case for circularities in case-based reasoning</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">P</forename><surname>Parsodkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chakraborti</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Procs. of the 31st International Conference on Case-Based Reasoning</title>
				<editor>
			<persName><forename type="first">S</forename><surname>Massie</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Chakraborti</surname></persName>
		</editor>
		<meeting>s. of the 31st International Conference on Case-Based Reasoning</meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="85" to="101" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
