=Paper=
{{Paper
|id=Vol-1353/paper_05
|storemode=property
|title=IPRed: Instance Reduction Algorithm Based on the Percentile of the Partitions
|pdfUrl=https://ceur-ws.org/Vol-1353/paper_05.pdf
|volume=Vol-1353
|dblpUrl=https://dblp.org/rec/conf/maics/TurkiW15
}}
==IPRed: Instance Reduction Algorithm Based on the Percentile of the Partitions==
IPRed: Instance Reduction Algorithm Based on the Percentile of the Partitions Turki Turki1,2 Zhi Wei2 1 2 Computer Science Department Department of Computer Science King Abdulaziz University New Jersey Institute of Technology P.O. Box 80221, Jeddah 21589, Saudi Arabia University Heights, Newark, NJ 07102 tturki@kau.edu.sa {ttt2,zhiwei}@njit.edu Abstract S according to criterion(Chou, Kuo, and Chang 2006). All the previously mentioned algorithms significantly reduce the Instance reduction methods are popular methods that re- storage requirement for the training set used by the classi- duce the size of the datasets to possibly improve the classification accuracy. We present a method that re- fication algorithms(Wilson and Martinez 2000; Chou, Kuo, duces the size of the dataset based on the percentile of and Chang 2006). However, the classification accuracy de- the dataset partitions which we call IPRed. We evalu- grades most of the time on the validation set compared to ate our and other popular instance reduction methods the classification accuracy of the classification algorithms from a classification perspective by 1-nearest neighbor using the original training set(Chou, Kuo, and Chang 2006; algorithm on many real datasets. Our experimental eval- Wilson and Martinez 2000). uation on the datasets shows that our method yields In this paper we consider approach that partitions the data- the minimum average error with statistical significance. set into training partitions Sk for k = 1...9 and a valida- Keywords: instance reduction, nearest neighbor, clas- tion partition V to compute the sum of minimum distances sification, statistical significance. between the validation partition V and each training parti- tion Sk . We then construct reduced training set Z by select- Introduction ing the training partitions that are less than the pth percentile of all the training partitions for specific p value. The goal is Instance reduction methods are popular methods used in to select the training partitions that contain similar instances machine learning and data mining that reduce the size of to the validation instances and remove those with dissim- the datasets to possibly improve the classification accur- ilar instances. The idea behind the strategy is that instances acy of the classification algorithms(Wilson and Martinez dissimilar with validation instances will have little, if not ad- 2000). Among many methods the condensed nearest neigh- verse, effect on predicting validation instances, and thus can bor(CNN)(Hart 1968) is a popular algorithm. CNN out- be removed with little, if not positive, effect on classifica- puts a reduced training set Z from the original training set tion. An example in real life is that a professor gives stu- S where Z classifies all instances in S correctly using 1- dents a take-home exam (i.e. validation set without labels) nearest neighbor (1NN)(Alpaydin 1997) . For given valida- and students will look for chapters (i.e. training partitions) 0 0 tion instance v = (x , y ), prediction is given by(Wu et al. of the book (i.e. training set) which contain similar inform- 2008) ation (i.e. instances of the training partitions) to the exam 0 0 questions (i.e. instances of the validation set) and skip irrel- y = arg max I(l = y1 ) (1) 0 l ∈L evant chapters. For given reduced training set Z the predic- 0 tion is given by using 1NN(Equation 1). Compared to other where l is the label, the set of all labels is denoted by popular instance reduction methods our experimental evalu- L, y1 is the label for the closest instance x obtained by se- ation on 30 datasets shows that our method IPRed yields the lecting the label for the instance x ∈ Z that gives the min- minimum average error with statistical significance. 0 imum distance from x , I(.) is an indicator function that out- The rest of this paper is organized as follows. In section II 0 puts 1 if its argument is true and 0 otherwise, and y ∈ L we review related work. In section III we present our method is the predicted label for the validation instance for which IPRed. Following that we present experimental evaluation 0 I(y = y1 ) attains its maximum. Reduced nearest neigh- and discussion before concluding. bor(RNN)(Gates 1972) is a modification to improve CNN which includes additional step that is removing instances Related Work from reduced training set Z that do not cause any misclassi- fication on the training set S(Bhatia and others 2010). Gen- Condensed Nearest Neighbor Rule eralized condensed nearest neighbor rule(GCNN) is another The CNN outlined in Algorithm 1 works as fol- variant of CNN which iteratively constructs reduced train- lows(Alpaydin 1997). CNN recieves training set S and ing set Z from training set S by selecting instances from empty subset Z(lines 1-2). At each iteration of the for loop of lines 6-13 instance x is selected randomly from the train- divide the dataset size k by 10 and take the floor of the res- ing set S(line 7). Lines 8-9 perform 1NN classification as ult. Lines 3-4 initialize the variables. The for loop of lines follows. In line 8 the instance zc is obtained by selecting the 5-18 creates 9 training partitions Pk for k = 1...9 by tak- instance zj in the stored subset Z that gives the minimum ing subsamples without replacement from dataset D where distance from x. Lines 9-10 store the instance x in the sub- the size of each Pk is the same. The for loop of lines 20-28 set Z if it is incorrectly classified. The do-while loop of lines creates the validation partition V1 by taking the remaining 4-14 terminates when the instances in the training set S are subsamples without replacement from the dataset D. Line correctly classified by the stored subset Z with 1NN. Line 29 assigns the 9 training partitions to S. In line 30 we assign 15 returns the reduced training set Z. For given Z the pre- the validation partition V1 to V . In line 31 our algorithm diction on the validation set V is given by 1NN(Equation outlined in Algorithm 3 is called by taking partitions in S 1). and V as inputs. IPRed algorithm outputs a training set Z of reduced size to guide the 1NN algorithm for better classi- Algorithm 1 Condensed Nearest Neighbor algorithm fication performance on the validation set V . The IPRed al- (CNN) gorithm which is outlined in Algorithm 3 works as follows. 1: CNN(S ,Z ) Line 2 stores the size of the validation partition V1 . The for 2: Z←∅ loop of lines 3-12 iterates 9 times to store the sum of the min- 3: P re add ← 1 imum euclidean distances(line 11) between the partitions as 4: do follows. At each iteration of the for loop of lines 5-10 we 5: add ← 0 store the computed euclidean distances(line 7) between the 6: for all instances in training set S do nth instance in V1 [n, j] and each instance i in the kth par- 7: Randomly select x from S tition Pk [i, j] for i = 1...p size and j = 1...d(for loop of 8: Find zc ∈ Z such that D(x, zc ) = minj D(x, zj ) lines 6-8). 9: if label(x) 6= label(zc ) then 10: Z ←Z∪x Algorithm 2 Dataset Partitioning algorithm (DP) 11: add ← 1 1: DP(D = {(x1 , y1 ), ..., (xk , yk )}) 12: k end if 2: size ← b 10 c 13: end for 3: b←1 14: while (P re add == add) 4: c ← size 15: return (Z) 5: for k = 1 to 9 do 6: q←1 7: r←1 Reduced Nearest Neighbor Rule 8: for i = b to size do The reduced nearest neighbor(RNN)(Gates 1972; Wilson 9: for j = 1 to d do and Martinez 2000) is a modification to CNN which starts 10: Pk [q, r] ← x[i, j] with training set S and reduced training set Z where all in- 11: r ←r+1 stances are copied from the training set S to the reduced 12: end for training set Z. The RNN algorithm iteratively removes each 13: Pk [q, r] ← y[i] instance from Z if the removal does not lead to any mis- 14: q ←q+1 classification of the other instances in S using the remaining 15: end for instances in Z. Prediction is performed on the validation set 16: b←b+c using Z. 17: size ← size + c 18: end for Generalized Condensed Nearest Neighbor Rule 19: q←1 The generalized condensed nearest neighbor(GCNN)(Chou, 20: for i = b to k do Kuo, and Chang 2006; Olvera-López et al. 2010) sequen- 21: r←1 tially adds instances from the training set S to the reduced 22: for j = 1 to d do training set Z when the instances are not absorbed by Z. In- 23: V1 [q, r] ← x[i, j] stance x ∈ S is absorbed when |x − a| − |x − b| > δ where 24: r ←r+1 a ∈ Z is the nearest instance to x with the same class la- 25: end for bel(i.e. label(x) = label(a)), b ∈ Z is the nearest instance to 26: V1 [q, r] ← y[i] x with different class label(i.e. label(x) 6= label(b)), and δ is 27: q ←q+1 the threshold. Prediction is made on the validation set using 28: end for Z. 29: S ← {P1 , P2 , ..., P9 } 30: V ← {V1 } IPRed 31: Z ← IP Red(S, V ) As shown in Algorithm 2 the DP algorithm receives as an input a dataset D ∈ Rk×d+1 of size k where the label as- We then store the minimum distance(line 9). After the for sociated to the instance xk ∈ Rd is yk (line 1). In line 2 we loop of lines 5-10 terminates, we store the sum of the min- . imum distances between the kth partition Pk and V1 (line Code Dataset Classes Dimensions Instances 11). Thus, the for loop of lines 3-12 iterates 9 times to store 1 Soy Bean 4 35 47 the results for the 9 training partitions(line 11). In line 13 2 Colon Cancer 2 2000 62 3 Leukemia 2 7129 72 we use the percentile as a robust measurement of the loc- 4 Breast Tissue 6 9 106 ation of the data to store the 75th percentile of all training 5 Appendicitis 2 7 106 partitions. Line 14 initializes j to 1. The for loop of lines 6 Iris 3 4 150 7 Hayes-Roth 3 4 160 15-20 selects the ith training partition that is less than 75th 8 Global Cancer Map 14 16063 190 percentile of all training partitions for i = 1...9. We store 9 Glass Identification 6 10 214 the index of the ith training partition that satisfies the con- 10 Heart 2 13 270 11 Haberman 2 3 306 dition(line 16) in P artitions index(line 17). After the for 12 Ecoli 8 6 336 loop of lines 22-31 terminates we return the reduced training 13 Liver Disorders 2 6 345 14 Bci 2 117 400 set Z(line 32) which contains the selected training partitions 15 Saheart 2 9 462 in P artitions index. Prediction on the validation set V is 16 Musk 2 166 476 given by 1NN(Equation 1) using Z. 17 Led 10 7 500 18 Climate 2 18 540 19 Breast Cancer 2 30 569 Algorithm 3 IPRed algorithm 20 Digits 2 63 762 21 Diabetes 2 8 768 1: IPR ED(S = {P1 , P2 , ..., P9 }, V = {V1 } ) 22 Vowel 11 13 990 23 Statlog German Credit Card 2 24 1000 2: v size ← V1 .size 24 Banknote Authentication 2 4 1372 3: for k = 1 to 9 do 25 Contraceptive 3 9 1473 4: p size ← Pk .size 26 Wine Quality Red 11 11 1599 27 Ozone 2 72 1847 5: for n = 1 to v size do 28 Steel Plates Faults 7 27 1941 6: for i = 1 to p size do 29 Insurance Company COIL 2000 2 85 5822 30 Magic Gamma Telescope 2 10 19020 s d P 7: distances[i] ← (Pk [i, j] − V1 [n, j])2 j=1 Table 1: Datasets from the UCI(A. Asuncion 2007), 8: end for KEEL(Alcalá et al. 2010), and Bioinformatics repositor- 9: nearest instances[n] ← M in(distances) ies(Jesús S. Aguilar-Ruiz ) that we used in our experimental 10: end for evaluation vP size 11: P artitions[k] ← nearest instances[t] t=1 12: end for methodology, then presents the experimental results. 13: Q4 ← P ercentile(P artitions, 75%) 14: j←1 Datasets 15: for i = 1 to length(P artitions) do We use 30 real datasets in our experiments for classification. 16: if P artitions[i] < Q4 then The information on datasets is tabulated in Table 1 which 17: P artitions index[j] ← i are ordered in terms of increasing number of instances. The 18: j ←j+1 datasets are obtained from three different sources. The ap- 19: end if pendicitis and saheart are obtained from KEEL-dataset re- 20: end for pository(Alcalá et al. 2010). The colon cancer, leukemia, 21: n←1 and global cancer map are obtained from BioInformatics 22: for m = 1 to length(P artitions index) do Research Group-dataset repository (Jesús S. Aguilar-Ruiz ). 23: k ← P artitions index[m] All the remaining datasets are from the UCI machine learn- 24: p size ← Pk .size ing repository(A. Asuncion 2007). 25: for i = 1 to p size do 26: for j = 1 to d + 1 do Experimental Methodology 27: Z[n, j] ← Pk [i, j] We evaluate four classification algorithms: IPRed+1NN, 28: end for CNN+1NN, RNN+1NN, and 1NN where the four al- 29: n←n+1 gorithms are 1-nearest neighbor applied to IPRed, 1-nearest 30: end for neighbor applied to condensed nearest neighbor, 1-nearest 31: end for neighbor applied to reduced nearest neighbor, and 1-nearest 32: return (Z) neighbor respectively. We use the 10-fold cross-validation on each dataset ensuring the same splits for each algorithm. For each instance reduction method we reduce the size k Experimental Evaluation on the original training set from S ∈ Rk×d+1 to Z ∈ 0 We experimentally evaluate the performance of our instance 0 Rk ×d+1 where k < k . We then apply 1NN for the reduction algorithm and compare it against others from a given Z on the validation set. Recall that 1NN is given by classification perspective(Japkowicz and Shah 2011) by 1- 0 y = argmax I(r = y1 )(Equation 1). We wrote our code in nearest neighbor(1NN) algorithm on 30 real datasets shown r in Table 1. This section describes the datasets, experimental R. Code Dataset IPRed+1NN % CNN+1NN % RNN+1NN % 1NN % 1 Soy Bean 2.5 64.4681 5 15.5319 5 14.6809 2.5 100 2 Colon Cancer 15.417 60.6452 18.75 35.4839 18.333 26.9355 15.417 100 3 Leukemia 10 60 16.8254 28.8889 20.794 24.5833 15.397 100 4 Breast Tissue 35.625 59.434 39.375 54.717 36.75 49.6226 41.75 100 5 Appendicitis 18.125 61.6981 21.125 33.3019 23.5 26.6038 18.5 100 6 Iris 4 60 6 12.1333 8 9.8667 4 100 7 Hayes-Roth 29.375 60 30 48.875 30.625 44.9375 29.375 100 8 Global Cancer Map 38.947 60 40 49.7368 41.579 44.7895 39.474 100 9 Glass Identification 0.8 60.5607 0.8762 7.2897 0.952 4.8131 0.876 100 10 Heart 39.259 60 41.1111 52.8519 41.852 47.4444 42.222 100 11 Haberman 29.444 59.4118 33.5556 47.4837 37.444 45.1961 29.111 100 12 Ecoli 22.867 60.1786 25.8042 37.5 27.063 33.869 23.823 100 13 Liver Disorders 34.472 60 41.9306 52.9275 39.578 47.3333 36.674 100 14 Bci 44 60 45.75 55 42.5 49.275 45.5 100 15 Saheart 42.88 59.9134 43.5507 53.5498 44.629 46.9048 43.551 100 16 Musk 15.905 60.1261 13.9904 27.563 15.054 21.9748 15.66841 100 17 Led 28.2 59 29 42.34 26.8 41.64 28.6 100 18 Climate 11.111 60 16.6667 24.1667 18.704 19.7963 11.111 100 19 Breast Cancer 9.316 60.1582 9.8764 14.9736 9.698 11.9332 9.008 100 20 Digits 0.128 60.0787 1.046 3.9895 1.704 3.1102 0.256 100 21 Diabetes 32.832 59.8958 34.8872 46.0026 35.207 39.3229 32.531 100 22 Vowel 36.263 60 37.7778 19.0202 40.101 17.5758 36.263 100 23 Statlog German Credit Card 34.5 60 34.6 49.38 36.7 47.39 33.4 100 24 Banknote Authentication 0.073 60.0437 0.219 1.7274 0.219 1.5306 0.073 100 25 Contraceptive 51.942 59.9593 53.9116 67.3999 54.314 67.1079 51.943 100 26 Wine Quality Red 54.552 60 56.7105 54.7154 57.697 49.7311 55.37 100 27 Ozone 11.323 60.0758 15.9808 22.1765 18.195 17.8073 11.162 100 28 Steel Plates Faults 61.619 60 62.443 65.4044 62.907 61.6383 61.722 100 29 Insurance Company COIL 2000 10.821 59.9966 13.1751 23.0831 13.243 22.6984 10.564 100 30 Magic Gamma Telescope 14.732 60 27.6656 35.469 28.465 30.643 24.022 100 Average 24.701 60.1881 27.253 36.0894 27.92 32.3585 25.662 100 Table 2: Average cross-validation error and average storage percentage % of the algorithms on each of the 30 real datasets from the UCI(A. Asuncion 2007), KEEL(Alcalá et al. 2010), and Bioinformatics repositories(Jesús S. Aguilar-Ruiz ). The algorithm with the minimum error is shown in bold. CNN+1NN RNN+1NN 1NN IPRed+1NN 0.000007 0.00001 0.02088 CNN+1NN 0.0226 0.00018 RNN+1NN 0.00044 Table 3: P-values of Wilcox rank test(two-tailed test) between all pairs of algorithms. Experimental Results on Thirty Datasets higher average error of 27.253% and has the minimum er- For each training-validation split in our classifications tasks ror in 1 out of 30 datasets. RNN+1NN gives an average er- we measure the error on the validation split as the number ror of 27.92% and has the minimum error in 2 out of 30 of instances incorrectly predicted divided by the number of datasets. RNN+1NN has the best average storage require- instances. We then take the average result of 10 folds to be ments of 32.3585%. The second best is CNN+1NN that average cross-validation error. For each dataset we measure gives average storage requirements of 36.0894%. The third the storage percentage as the number of instances in the re- best is IPRed which gives average storage requirements of duced training set divided by the number of instances in the 60.1881%. 1NN uses the original training set. Thus, 1NN training set and take the average result of 10 trials in the gives average storage requirements of 100%. cross-validation to be average storage percentage. In Table 2 To verify that differences in classification accuracy on we tabulate average cross-validation error and average stor- the datasets are statistically significant we use Wilcoxon age percentage % results on each dataset. rank test(Japkowicz and Shah 2011; Kanji 2006) which is a Over the 30 datasets IPRed+1NN yields the minimum av- standard test to measure the statistical significance between erage error of 24.701% and has the minimum error in 14 two methods in many datasets. It shows that one method is out of 30 datasets. The second best is 1NN that gives an considered statistically significant than the other method if average error of 25.662% and has the minimum error in 6 it outperforms the other method in many datasets. The p- out of 30 datasets. The third best is CNN+1NN that gives values in Table 3 show that our method IPRed+1NN outper- forms the other methods on the 30 datasets with statistical Hart, P. 1968. The condensed nearest neighbor rule significance. (corresp.). Information Theory, IEEE Transactions on 14(3):515–516. Discussion Japkowicz, N., and Shah, M. 2011. Evaluating Learning IPRed+1NN divides the training set into 9 partitions each Algorithms. Cambridge University Press. of the same size. For each training partition obtained from Jesús S. Aguilar-Ruiz. Dataset repository. available at the training set we store the sum of the minimum distances www.upo.es/eps/aguilar/datasets.html. between the training partition and the validation partition. Kanji, G. K. 2006. 100 statistical tests. Sage. We then construct the reduced dataset from the selected training partitions that are less than 75th percentile of all the Olvera-López, J. A.; Carrasco-Ochoa, J. A.; Martı́nez- training partitions. Finally, we measure the performance of Trinidad, J. F.; and Kittler, J. 2010. A review of instance 1NN on the validation set using the given reduced training selection methods. Artificial Intelligence Review 34(2):133– set. This approach outperforms CNN+1NN and RNN+1NN 143. in terms of accuracy(results shown in Table 2 and Table 3). Ripley, B.; Venables, W.; and Ripley, M. B. 2013. Package In this study we chose 1NN as a classification algorithm class. due to its popularity and its efficiency for CNN and RNN Wilson, D. R., and Martinez, T. R. 2000. Reduction tech- algorithms. Other classifiers such as support vector ma- niques for instance-based learning algorithms. Machine chine(SVM) can be used which may outperform 1NN. How- learning 38(3):257–286. ever, optimizing SVM by performing cross-validation to se- Wu, X.; Kumar, V.; Quinlan, J. R.; Ghosh, J.; Yang, Q.; Mo- lect the best parameters increases the runtime. IPRed+1NN toda, H.; McLachlan, G. J.; Ng, A.; Liu, B.; Philip, S. Y.; gives higher average storage percentage than CNN+1NN et al. 2008. Top 10 algorithms in data mining. Knowledge and RNN+1NN on the datasets except for steel plates faults and Information Systems 14(1):1–37. and contraceptive datasets(results shown in Table 3). Thus, It is the slowest algorithm in the conducted experiments but still computationally tractable for the large real data- sets. However, It gives better classification accuracy than the other methods most of the time. We used the standard pack- age for CNN+1NN and RNN+1NN in R(Ripley, Venables, and Ripley 2013). Conclusion We present instance reduction algorithm based on the per- centile of the partitions that outputs dataset of reduced size used by 1-nearest neighbor algorithm for classification. Our algorithm leads to better classification performance than the other popular methods by obtaining the minimum average error with statistical significance on many real datasets. References A. Asuncion, D. N. 2007. UCI machine learning repository. Alcalá, J.; Fernández, A.; Luengo, J.; Derrac, J.; Garcı́a, S.; Sánchez, L.; and Herrera, F. 2010. Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple- Valued Logic and Soft Computing 17:255–287. Alpaydin, E. 1997. Voting over multiple condensed nearest neighbors. Artificial Intelligence Review 11(1-5):115–132. Bhatia, N., et al. 2010. Survey of nearest neighbor tech- niques. arXiv preprint arXiv:1007.0085. Chou, C.-H.; Kuo, B.-H.; and Chang, F. 2006. The gener- alized condensed nearest neighbor rule as a data reduction method. In Pattern Recognition, 2006. ICPR 2006. 18th In- ternational Conference on, volume 2, 556–559. IEEE. Gates, G. 1972. The reduced nearest neighbor rule (corresp.). Information Theory, IEEE Transactions on 18(3):431–433.