=Paper= {{Paper |id=Vol-1353/paper_05 |storemode=property |title=IPRed: Instance Reduction Algorithm Based on the Percentile of the Partitions |pdfUrl=https://ceur-ws.org/Vol-1353/paper_05.pdf |volume=Vol-1353 |dblpUrl=https://dblp.org/rec/conf/maics/TurkiW15 }} ==IPRed: Instance Reduction Algorithm Based on the Percentile of the Partitions== https://ceur-ws.org/Vol-1353/paper_05.pdf
  IPRed: Instance Reduction Algorithm Based on the Percentile of the Partitions

                               Turki Turki1,2                                            Zhi Wei2
                        1                                                   2
                     Computer Science Department                            Department of Computer Science
                      King Abdulaziz University                            New Jersey Institute of Technology
              P.O. Box 80221, Jeddah 21589, Saudi Arabia                  University Heights, Newark, NJ 07102
                          tturki@kau.edu.sa                                      {ttt2,zhiwei}@njit.edu



                              Abstract                                S according to criterion(Chou, Kuo, and Chang 2006). All
                                                                      the previously mentioned algorithms significantly reduce the
     Instance reduction methods are popular methods that re-
                                                                      storage requirement for the training set used by the classi-
     duce the size of the datasets to possibly improve the
     classification accuracy. We present a method that re-            fication algorithms(Wilson and Martinez 2000; Chou, Kuo,
     duces the size of the dataset based on the percentile of         and Chang 2006). However, the classification accuracy de-
     the dataset partitions which we call IPRed. We evalu-            grades most of the time on the validation set compared to
     ate our and other popular instance reduction methods             the classification accuracy of the classification algorithms
     from a classification perspective by 1-nearest neighbor          using the original training set(Chou, Kuo, and Chang 2006;
     algorithm on many real datasets. Our experimental eval-          Wilson and Martinez 2000).
     uation on the datasets shows that our method yields                 In this paper we consider approach that partitions the data-
     the minimum average error with statistical significance.         set into training partitions Sk for k = 1...9 and a valida-
     Keywords: instance reduction, nearest neighbor, clas-            tion partition V to compute the sum of minimum distances
     sification, statistical significance.                            between the validation partition V and each training parti-
                                                                      tion Sk . We then construct reduced training set Z by select-
                            Introduction                              ing the training partitions that are less than the pth percentile
                                                                      of all the training partitions for specific p value. The goal is
Instance reduction methods are popular methods used in                to select the training partitions that contain similar instances
machine learning and data mining that reduce the size of              to the validation instances and remove those with dissim-
the datasets to possibly improve the classification accur-            ilar instances. The idea behind the strategy is that instances
acy of the classification algorithms(Wilson and Martinez              dissimilar with validation instances will have little, if not ad-
2000). Among many methods the condensed nearest neigh-                verse, effect on predicting validation instances, and thus can
bor(CNN)(Hart 1968) is a popular algorithm. CNN out-                  be removed with little, if not positive, effect on classifica-
puts a reduced training set Z from the original training set          tion. An example in real life is that a professor gives stu-
S where Z classifies all instances in S correctly using 1-            dents a take-home exam (i.e. validation set without labels)
nearest neighbor (1NN)(Alpaydin 1997) . For given valida-             and students will look for chapters (i.e. training partitions)
                     0  0
tion instance v = (x , y ), prediction is given by(Wu et al.          of the book (i.e. training set) which contain similar inform-
2008)                                                                 ation (i.e. instances of the training partitions) to the exam
                    0                    0                            questions (i.e. instances of the validation set) and skip irrel-
                  y = arg max I(l = y1 )                        (1)
                          0
                               l ∈L                                   evant chapters. For given reduced training set Z the predic-
          0
                                                                      tion is given by using 1NN(Equation 1). Compared to other
   where l is the label, the set of all labels is denoted by          popular instance reduction methods our experimental evalu-
L, y1 is the label for the closest instance x obtained by se-         ation on 30 datasets shows that our method IPRed yields the
lecting the label for the instance x ∈ Z that gives the min-          minimum average error with statistical significance.
                       0
imum distance from x , I(.) is an indicator function that out-           The rest of this paper is organized as follows. In section II
                                                       0
puts 1 if its argument is true and 0 otherwise, and y ∈ L             we review related work. In section III we present our method
is the predicted label for the validation instance for which          IPRed. Following that we present experimental evaluation
    0
I(y = y1 ) attains its maximum. Reduced nearest neigh-                and discussion before concluding.
bor(RNN)(Gates 1972) is a modification to improve CNN
which includes additional step that is removing instances                                   Related Work
from reduced training set Z that do not cause any misclassi-
fication on the training set S(Bhatia and others 2010). Gen-
                                                                      Condensed Nearest Neighbor Rule
eralized condensed nearest neighbor rule(GCNN) is another             The CNN outlined in Algorithm 1 works as fol-
variant of CNN which iteratively constructs reduced train-            lows(Alpaydin 1997). CNN recieves training set S and
ing set Z from training set S by selecting instances from             empty subset Z(lines 1-2). At each iteration of the for loop
of lines 6-13 instance x is selected randomly from the train-       divide the dataset size k by 10 and take the floor of the res-
ing set S(line 7). Lines 8-9 perform 1NN classification as          ult. Lines 3-4 initialize the variables. The for loop of lines
follows. In line 8 the instance zc is obtained by selecting the     5-18 creates 9 training partitions Pk for k = 1...9 by tak-
instance zj in the stored subset Z that gives the minimum           ing subsamples without replacement from dataset D where
distance from x. Lines 9-10 store the instance x in the sub-        the size of each Pk is the same. The for loop of lines 20-28
set Z if it is incorrectly classified. The do-while loop of lines   creates the validation partition V1 by taking the remaining
4-14 terminates when the instances in the training set S are        subsamples without replacement from the dataset D. Line
correctly classified by the stored subset Z with 1NN. Line          29 assigns the 9 training partitions to S. In line 30 we assign
15 returns the reduced training set Z. For given Z the pre-         the validation partition V1 to V . In line 31 our algorithm
diction on the validation set V is given by 1NN(Equation            outlined in Algorithm 3 is called by taking partitions in S
1).                                                                 and V as inputs. IPRed algorithm outputs a training set Z
                                                                    of reduced size to guide the 1NN algorithm for better classi-
Algorithm 1 Condensed Nearest Neighbor algorithm                    fication performance on the validation set V . The IPRed al-
(CNN)                                                               gorithm which is outlined in Algorithm 3 works as follows.
 1: CNN(S ,Z )
                                                                    Line 2 stores the size of the validation partition V1 . The for
 2:     Z←∅
                                                                    loop of lines 3-12 iterates 9 times to store the sum of the min-
 3:     P re add ← 1
                                                                    imum euclidean distances(line 11) between the partitions as
 4:     do
                                                                    follows. At each iteration of the for loop of lines 5-10 we
 5:          add ← 0
                                                                    store the computed euclidean distances(line 7) between the
 6:          for all instances in training set S do
                                                                    nth instance in V1 [n, j] and each instance i in the kth par-
 7:              Randomly select x from S                           tition Pk [i, j] for i = 1...p size and j = 1...d(for loop of
 8:              Find zc ∈ Z such that D(x, zc ) = minj D(x, zj )   lines 6-8).
 9:              if label(x) 6= label(zc ) then
10:                  Z ←Z∪x                                         Algorithm 2 Dataset Partitioning algorithm (DP)
11:                 add ← 1                                          1: DP(D = {(x1 , y1 ), ..., (xk , yk )})
12:                                                                                    k
                end if                                               2:    size ← b 10   c
13:         end for                                                  3:    b←1
14:     while (P re add == add)                                      4:    c ← size
15:     return (Z)                                                   5:    for k = 1 to 9 do
                                                                     6:        q←1
                                                                     7:        r←1
Reduced Nearest Neighbor Rule                                        8:        for i = b to size do
The reduced nearest neighbor(RNN)(Gates 1972; Wilson                 9:            for j = 1 to d do
and Martinez 2000) is a modification to CNN which starts            10:                 Pk [q, r] ← x[i, j]
with training set S and reduced training set Z where all in-        11:                 r ←r+1
stances are copied from the training set S to the reduced           12:            end for
training set Z. The RNN algorithm iteratively removes each          13:            Pk [q, r] ← y[i]
instance from Z if the removal does not lead to any mis-            14:            q ←q+1
classification of the other instances in S using the remaining      15:        end for
instances in Z. Prediction is performed on the validation set       16:        b←b+c
using Z.                                                            17:        size ← size + c
                                                                    18:    end for
Generalized Condensed Nearest Neighbor Rule                         19:    q←1
The generalized condensed nearest neighbor(GCNN)(Chou,              20:    for i = b to k do
Kuo, and Chang 2006; Olvera-López et al. 2010) sequen-             21:        r←1
tially adds instances from the training set S to the reduced        22:        for j = 1 to d do
training set Z when the instances are not absorbed by Z. In-        23:            V1 [q, r] ← x[i, j]
stance x ∈ S is absorbed when |x − a| − |x − b| > δ where           24:            r ←r+1
a ∈ Z is the nearest instance to x with the same class la-          25:        end for
bel(i.e. label(x) = label(a)), b ∈ Z is the nearest instance to     26:        V1 [q, r] ← y[i]
x with different class label(i.e. label(x) 6= label(b)), and δ is   27:        q ←q+1
the threshold. Prediction is made on the validation set using       28:    end for
Z.                                                                  29:    S ← {P1 , P2 , ..., P9 }
                                                                    30:    V ← {V1 }
                             IPRed                                  31:    Z ← IP Red(S, V )
As shown in Algorithm 2 the DP algorithm receives as an
input a dataset D ∈ Rk×d+1 of size k where the label as-            We then store the minimum distance(line 9). After the for
sociated to the instance xk ∈ Rd is yk (line 1). In line 2 we       loop of lines 5-10 terminates, we store the sum of the min-
                                                                                                         .
imum distances between the kth partition Pk and V1 (line                 Code   Dataset                       Classes   Dimensions   Instances
11). Thus, the for loop of lines 3-12 iterates 9 times to store          1      Soy Bean                      4         35           47
the results for the 9 training partitions(line 11). In line 13           2      Colon Cancer                  2         2000         62
                                                                         3      Leukemia                      2         7129         72
we use the percentile as a robust measurement of the loc-                4      Breast Tissue                 6         9            106
ation of the data to store the 75th percentile of all training           5      Appendicitis                  2         7            106
partitions. Line 14 initializes j to 1. The for loop of lines            6      Iris                          3         4            150
                                                                         7      Hayes-Roth                    3         4            160
15-20 selects the ith training partition that is less than 75th          8      Global Cancer Map             14        16063        190
percentile of all training partitions for i = 1...9. We store            9      Glass Identification          6         10           214
the index of the ith training partition that satisfies the con-          10     Heart                         2         13           270
                                                                         11     Haberman                      2         3            306
dition(line 16) in P artitions index(line 17). After the for             12     Ecoli                         8         6            336
loop of lines 22-31 terminates we return the reduced training            13     Liver Disorders               2         6            345
                                                                         14     Bci                           2         117          400
set Z(line 32) which contains the selected training partitions           15     Saheart                       2         9            462
in P artitions index. Prediction on the validation set V is              16     Musk                          2         166          476
given by 1NN(Equation 1) using Z.                                        17     Led                           10        7            500
                                                                         18     Climate                       2         18           540
                                                                         19     Breast Cancer                 2         30           569
Algorithm 3 IPRed algorithm                                              20     Digits                        2         63           762
                                                                         21     Diabetes                      2         8            768
 1: IPR ED(S = {P1 , P2 , ..., P9 }, V = {V1 } )                         22     Vowel                         11        13           990
                                                                         23     Statlog German Credit Card    2         24           1000
 2:     v size ← V1 .size                                                24     Banknote Authentication       2         4            1372
 3:     for k = 1 to 9 do                                                25     Contraceptive                 3         9            1473
 4:         p size ← Pk .size                                            26     Wine Quality Red              11        11           1599
                                                                         27     Ozone                         2         72           1847
 5:         for n = 1 to v size do                                       28     Steel Plates Faults           7         27           1941
 6:             for i = 1 to p size do                                   29     Insurance Company COIL 2000   2         85           5822
                                                                         30     Magic Gamma Telescope         2         10           19020
                                     s
                                         d
                                         P
 7:                 distances[i] ←           (Pk [i, j] − V1 [n, j])2
                                       j=1                              Table 1: Datasets from the UCI(A. Asuncion 2007),
 8:             end for                                                 KEEL(Alcalá et al. 2010), and Bioinformatics repositor-
 9:             nearest instances[n] ← M in(distances)                  ies(Jesús S. Aguilar-Ruiz ) that we used in our experimental
10:         end for                                                     evaluation
                             vP
                              size
11:         P artitions[k] ←       nearest instances[t]
                                t=1
12:     end for                                                         methodology, then presents the experimental results.
13:     Q4 ← P ercentile(P artitions, 75%)
14:     j←1                                                             Datasets
15:     for i = 1 to length(P artitions) do                             We use 30 real datasets in our experiments for classification.
16:         if P artitions[i] < Q4 then                                 The information on datasets is tabulated in Table 1 which
17:              P artitions index[j] ← i                               are ordered in terms of increasing number of instances. The
18:              j ←j+1                                                 datasets are obtained from three different sources. The ap-
19:         end if                                                      pendicitis and saheart are obtained from KEEL-dataset re-
20:     end for                                                         pository(Alcalá et al. 2010). The colon cancer, leukemia,
21:     n←1                                                             and global cancer map are obtained from BioInformatics
22:     for m = 1 to length(P artitions index) do                       Research Group-dataset repository (Jesús S. Aguilar-Ruiz ).
23:         k ← P artitions index[m]                                    All the remaining datasets are from the UCI machine learn-
24:         p size ← Pk .size                                           ing repository(A. Asuncion 2007).
25:         for i = 1 to p size do
26:              for j = 1 to d + 1 do                                  Experimental Methodology
27:                  Z[n, j] ← Pk [i, j]                                We evaluate four classification algorithms: IPRed+1NN,
28:              end for                                                CNN+1NN, RNN+1NN, and 1NN where the four al-
29:              n←n+1                                                  gorithms are 1-nearest neighbor applied to IPRed, 1-nearest
30:         end for                                                     neighbor applied to condensed nearest neighbor, 1-nearest
31:     end for                                                         neighbor applied to reduced nearest neighbor, and 1-nearest
32:     return (Z)                                                      neighbor respectively. We use the 10-fold cross-validation
                                                                        on each dataset ensuring the same splits for each algorithm.
                                                                        For each instance reduction method we reduce the size k
              Experimental Evaluation                                   on the original training set from S ∈ Rk×d+1 to Z ∈
                                                                           0
We experimentally evaluate the performance of our instance                                 0
                                                                        Rk ×d+1 where k < k . We then apply 1NN for the
reduction algorithm and compare it against others from a                given Z on the validation set. Recall that 1NN is given by
classification perspective(Japkowicz and Shah 2011) by 1-                0
                                                                        y = argmax I(r = y1 )(Equation 1). We wrote our code in
nearest neighbor(1NN) algorithm on 30 real datasets shown                        r
in Table 1. This section describes the datasets, experimental           R.
 Code    Dataset                         IPRed+1NN      %         CNN+1NN     %         RNN+1NN      %          1NN        %
 1       Soy Bean                        2.5            64.4681   5           15.5319   5            14.6809    2.5        100
 2       Colon Cancer                    15.417         60.6452   18.75       35.4839   18.333       26.9355    15.417     100
 3       Leukemia                        10             60        16.8254     28.8889   20.794       24.5833    15.397     100
 4       Breast Tissue                   35.625         59.434    39.375      54.717    36.75        49.6226    41.75      100
 5       Appendicitis                    18.125         61.6981   21.125      33.3019   23.5         26.6038    18.5       100
 6       Iris                            4              60        6           12.1333   8            9.8667     4          100
 7       Hayes-Roth                      29.375         60        30          48.875    30.625       44.9375    29.375     100
 8       Global Cancer Map               38.947         60        40          49.7368   41.579       44.7895    39.474     100
 9       Glass Identification            0.8            60.5607   0.8762      7.2897    0.952        4.8131     0.876      100
 10      Heart                           39.259         60        41.1111     52.8519   41.852       47.4444    42.222     100
 11      Haberman                        29.444         59.4118   33.5556     47.4837   37.444       45.1961    29.111     100
 12      Ecoli                           22.867         60.1786   25.8042     37.5      27.063       33.869     23.823     100
 13      Liver Disorders                 34.472         60        41.9306     52.9275   39.578       47.3333    36.674     100
 14      Bci                             44             60        45.75       55        42.5         49.275     45.5       100
 15      Saheart                         42.88          59.9134   43.5507     53.5498   44.629       46.9048    43.551     100
 16      Musk                            15.905         60.1261   13.9904     27.563    15.054       21.9748    15.66841   100
 17      Led                             28.2           59        29          42.34     26.8         41.64      28.6       100
 18      Climate                         11.111         60        16.6667     24.1667   18.704       19.7963    11.111     100
 19      Breast Cancer                   9.316          60.1582   9.8764      14.9736   9.698        11.9332    9.008      100
 20      Digits                          0.128          60.0787   1.046       3.9895    1.704        3.1102     0.256      100
 21      Diabetes                        32.832         59.8958   34.8872     46.0026   35.207       39.3229    32.531     100
 22      Vowel                           36.263         60        37.7778     19.0202   40.101       17.5758    36.263     100
 23      Statlog German Credit Card      34.5           60        34.6        49.38     36.7         47.39      33.4       100
 24      Banknote Authentication         0.073          60.0437   0.219       1.7274    0.219        1.5306     0.073      100
 25      Contraceptive                   51.942         59.9593   53.9116     67.3999   54.314       67.1079    51.943     100
 26      Wine Quality Red                54.552         60        56.7105     54.7154   57.697       49.7311    55.37      100
 27      Ozone                           11.323         60.0758   15.9808     22.1765   18.195       17.8073    11.162     100
 28      Steel Plates Faults             61.619         60        62.443      65.4044   62.907       61.6383    61.722     100
 29      Insurance Company COIL 2000     10.821         59.9966   13.1751     23.0831   13.243       22.6984    10.564     100
 30      Magic Gamma Telescope           14.732         60        27.6656     35.469    28.465       30.643     24.022     100
         Average                         24.701         60.1881   27.253      36.0894   27.92        32.3585    25.662     100

Table 2: Average cross-validation error and average storage percentage % of the algorithms on each of the 30 real datasets from
the UCI(A. Asuncion 2007), KEEL(Alcalá et al. 2010), and Bioinformatics repositories(Jesús S. Aguilar-Ruiz ). The algorithm
with the minimum error is shown in bold.

                                                     CNN+1NN         RNN+1NN         1NN
                                     IPRed+1NN        0.000007        0.00001       0.02088
                                      CNN+1NN                          0.0226       0.00018
                                      RNN+1NN                                       0.00044

                     Table 3: P-values of Wilcox rank test(two-tailed test) between all pairs of algorithms.


Experimental Results on Thirty Datasets                             higher average error of 27.253% and has the minimum er-
For each training-validation split in our classifications tasks     ror in 1 out of 30 datasets. RNN+1NN gives an average er-
we measure the error on the validation split as the number          ror of 27.92% and has the minimum error in 2 out of 30
of instances incorrectly predicted divided by the number of         datasets. RNN+1NN has the best average storage require-
instances. We then take the average result of 10 folds to be        ments of 32.3585%. The second best is CNN+1NN that
average cross-validation error. For each dataset we measure         gives average storage requirements of 36.0894%. The third
the storage percentage as the number of instances in the re-        best is IPRed which gives average storage requirements of
duced training set divided by the number of instances in the        60.1881%. 1NN uses the original training set. Thus, 1NN
training set and take the average result of 10 trials in the        gives average storage requirements of 100%.
cross-validation to be average storage percentage. In Table 2          To verify that differences in classification accuracy on
we tabulate average cross-validation error and average stor-        the datasets are statistically significant we use Wilcoxon
age percentage % results on each dataset.                           rank test(Japkowicz and Shah 2011; Kanji 2006) which is a
   Over the 30 datasets IPRed+1NN yields the minimum av-            standard test to measure the statistical significance between
erage error of 24.701% and has the minimum error in 14              two methods in many datasets. It shows that one method is
out of 30 datasets. The second best is 1NN that gives an            considered statistically significant than the other method if
average error of 25.662% and has the minimum error in 6             it outperforms the other method in many datasets. The p-
out of 30 datasets. The third best is CNN+1NN that gives            values in Table 3 show that our method IPRed+1NN outper-
forms the other methods on the 30 datasets with statistical         Hart, P. 1968. The condensed nearest neighbor rule
significance.                                                       (corresp.). Information Theory, IEEE Transactions on
                                                                    14(3):515–516.
                         Discussion                                 Japkowicz, N., and Shah, M. 2011. Evaluating Learning
IPRed+1NN divides the training set into 9 partitions each           Algorithms. Cambridge University Press.
of the same size. For each training partition obtained from         Jesús S. Aguilar-Ruiz. Dataset repository. available at
the training set we store the sum of the minimum distances          www.upo.es/eps/aguilar/datasets.html.
between the training partition and the validation partition.        Kanji, G. K. 2006. 100 statistical tests. Sage.
We then construct the reduced dataset from the selected
training partitions that are less than 75th percentile of all the   Olvera-López, J. A.; Carrasco-Ochoa, J. A.; Martı́nez-
training partitions. Finally, we measure the performance of         Trinidad, J. F.; and Kittler, J. 2010. A review of instance
1NN on the validation set using the given reduced training          selection methods. Artificial Intelligence Review 34(2):133–
set. This approach outperforms CNN+1NN and RNN+1NN                  143.
in terms of accuracy(results shown in Table 2 and Table 3).         Ripley, B.; Venables, W.; and Ripley, M. B. 2013. Package
   In this study we chose 1NN as a classification algorithm         class.
due to its popularity and its efficiency for CNN and RNN            Wilson, D. R., and Martinez, T. R. 2000. Reduction tech-
algorithms. Other classifiers such as support vector ma-            niques for instance-based learning algorithms. Machine
chine(SVM) can be used which may outperform 1NN. How-               learning 38(3):257–286.
ever, optimizing SVM by performing cross-validation to se-          Wu, X.; Kumar, V.; Quinlan, J. R.; Ghosh, J.; Yang, Q.; Mo-
lect the best parameters increases the runtime. IPRed+1NN           toda, H.; McLachlan, G. J.; Ng, A.; Liu, B.; Philip, S. Y.;
gives higher average storage percentage than CNN+1NN                et al. 2008. Top 10 algorithms in data mining. Knowledge
and RNN+1NN on the datasets except for steel plates faults          and Information Systems 14(1):1–37.
and contraceptive datasets(results shown in Table 3). Thus,
It is the slowest algorithm in the conducted experiments
but still computationally tractable for the large real data-
sets. However, It gives better classification accuracy than the
other methods most of the time. We used the standard pack-
age for CNN+1NN and RNN+1NN in R(Ripley, Venables,
and Ripley 2013).

                        Conclusion
We present instance reduction algorithm based on the per-
centile of the partitions that outputs dataset of reduced size
used by 1-nearest neighbor algorithm for classification. Our
algorithm leads to better classification performance than the
other popular methods by obtaining the minimum average
error with statistical significance on many real datasets.

                        References
A. Asuncion, D. N. 2007. UCI machine learning repository.
Alcalá, J.; Fernández, A.; Luengo, J.; Derrac, J.; Garcı́a,
S.; Sánchez, L.; and Herrera, F. 2010. Keel data-mining
software tool: Data set repository, integration of algorithms
and experimental analysis framework. Journal of Multiple-
Valued Logic and Soft Computing 17:255–287.
Alpaydin, E. 1997. Voting over multiple condensed nearest
neighbors. Artificial Intelligence Review 11(1-5):115–132.
Bhatia, N., et al. 2010. Survey of nearest neighbor tech-
niques. arXiv preprint arXiv:1007.0085.
Chou, C.-H.; Kuo, B.-H.; and Chang, F. 2006. The gener-
alized condensed nearest neighbor rule as a data reduction
method. In Pattern Recognition, 2006. ICPR 2006. 18th In-
ternational Conference on, volume 2, 556–559. IEEE.
Gates, G. 1972. The reduced nearest neighbor rule
(corresp.). Information Theory, IEEE Transactions on
18(3):431–433.