=Paper= {{Paper |id=Vol-2786/Paper28 |storemode=property |title=CKD-Tree: An Improved KD-Tree Construction Algorithm |pdfUrl=https://ceur-ws.org/Vol-2786/Paper28.pdf |volume=Vol-2786 |authors=Y Narasimhulu,Ashok Suthar,Raghunadh Pasunuri,V China Venkaiah |dblpUrl=https://dblp.org/rec/conf/isic2/NarasimhuluSPV21 }} ==CKD-Tree: An Improved KD-Tree Construction Algorithm== https://ceur-ws.org/Vol-2786/Paper28.pdf
CKD-TREE: AN IMPROVED KD-TREE CONSTRUCTION
ALGORITHM
Y Narasimhulua , Ashok Suthara , Raghunadh Pasunurib and V China Venkaiaha
a SCIS, University of Hyderabad, Prof. CR Rao Road, Gachibowli, Hyderabad, 500046, India
b CSE, Malla Reddy Engineering College(Autonomous), Maisammaguda(H), Gundlapochampally Village, Medchal Mandal,

Medchal-Malkajgiri District, Telangana State, 500100, India


                                       Abstract
                                       Data structures such as VP-Tree, R-Tree and KD-Tree builds an index of all the data available in the offline phase and uses
                                       that indexed tree to search for and answer nearest neighbor queries or to classify the input query. We use a Lightweight
                                       Coreset algorithm to reduce the actual data size used to build the tree index, resulting in a faster index building time. We
                                       improve on already available Nearest Neighbor based Classification techniques and pit our classification method against the
                                       widely accepted, state of the art data structures such as VP-Tree, R-Tree and KD-Tree. In terms of speed the proposed method
                                       out performs the compared data structures, as the size of the data increases.

                                       Keywords
                                       KD Tree, Coresets, Nearest Neighbor, Classification.


1. Introduction                                                                        by selecting a position in the space called vantage point
                                                                                       and partitions the data into two parts. The first part
π‘˜-Nearest Neighbor (π‘˜NN) problem refers to the prob- contains data that are closer to vantage point and the
lem of finding π‘˜ points or samples in the data which other part which are not closer to the point. The di-
are closest to the query point. Nearest Neighbor al- vision process continues until there are smaller sets.
gorithm finds its use in several machine learning ar- Finally a tree is constructed such that the neigbors in
eas, such as classification and regression and is also the tree are also neigbors in the real space. R-Tree[2]
the most time-consuming part of these applications. is another data structure that is most commonly used
In different use cases such as in recommendation sys- to store spatial objects such as location of gas stations,
tems, computer vision and robotics etc, fast response restaurants, outlines of agricultural lands and etc.
times are critical and using brute force approaches such                                  In this paper we consider π‘˜NN for classification, where
as linear search is not feasible. Hence there are sev- nearest neighbors of a query point in the dataset are
eral approaches to solve these Nearest Neighbor prob- used to classify the query point. Nearest neighbor in
lems which are based on Hashing, Graphs or Space- essence is a lazy learning algorithm, i.e. it memorizes
Partitioning Trees. Space-partitioning methods are gen- the whole training dataset to provide the nearest neigh-
erally more efficient due to less tunable parameters.                                  bors of an incoming query point. Consequently, though
   One such algorithm is KD-Tree. It is a space parti- the algorithms provide very efficient solutions to the
tioning algorithm which divides space recursively us- nearest neighbor problem, they might run into prob-
ing a hyper-plane based on a splitting rul. It reduces lems. This is because data size becomes too large due
the search space by almost half at every iteration. An- to the high magnitudes of data available today to pro-
other space partitioning algorithm is Vantage Point Tree cess. In critical systems where time is of essence, loos-
(VP-Tree)[1], which divides the data in a metric space ing even a few seconds while processing all that data
ISIC 2021: International Semantic Intelligence Conference, February                    might cause issues. The author in [3] uses SVM to
25–27, 2021, New Delhi, India                                                          tackle a similar problem by reducing the size of data
" narasimedu@gmail.com (Y. Narasimhulu);                                               on which Nearest Neighbor algorithm runs. We use
ashok.suthar.sce@gmail.com (A. Suthar);
raghupasunuri@gmail.com (R. Pasunuri); venkaiah@hotmail.com
                                                                                       coresets for a similar effect, but on very large datasets.
(V.C. Venkaiah)                                                                           The concept of coresets follows a data summariza-
~ https://github.com/Narasim (Y. Narasimhulu);                                         tion approach. Coresets are small subsets of the orig-
https://www.linkedin.com/in/ashok-suthar (A. Suthar);                                  inal data. They are used to scale clustering problems
http://cse.mrec.ac.in/StaffDetails?FacultyId=3072 (R. Pasunuri);
https://scis.uohyd.ac.in/People/profile/vch_profile.php (V.C.
                                                                                       in massive data sets. Models trained on Coresets pro-
Venkaiah)                                                                              vide competitive results against a model trained on full
 0000-0001-5440-0200 (V.C. Venkaiah)                                                  original dataset. Hence these can be very useful in
          Β© 2021 Copyright for this paper by its authors. Use permitted under Creative
          Commons License Attribution 4.0 International (CC BY 4.0).                   speeding up said models while still keeping up theorit-
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073       CEUR Conference Proceedings (CEUR-WS.org)
                                                                                                              212


ical guarantees upto a level. Coresets are often used in
clustering algorithms to improve their speed even fur-
ther. To achieve this, first construct a coreset β€” usu-
ally in linear time β€” and then use an algorithm that
works on coreset to solve the clustering problem. As
the coreset size is very small compared to the actual
data size, this can provide significant speed in the said
algorithms.
   We use a state of the art lightweight coreset con-
struction algorithm to improve time in the case of solv- Figure 1: Example of the sliding-midpoint rule.
ing Nearest Neighbor problem using KD-Tree space
partitioning algorithm. We use the end result of Near-
est Neighbor query to classify our input query point
based on its nearest neighbor points found.


2. Related Work
As foretold there are a number of approaches to solve
a nearest neighbor problem based on hashing, graphs,
space-partitioning etc. We focus mainly on KD-Tree, a
widely accepted, widely used and fast space partition- Figure 2: Examples of splitting rules on a common data set.
ing technique for Nearest Neighbors or Classification
problems than VP-Tree and R-Tree.
                                                           plane is not moved. However, if one side of the split-
2.1. KD-Tree (K Dimensional-tree)                          ting plane is empty, i.e. without any point, then they
                                                           slide the splitting plane towards the data points until
KD Tree is a space-partitioning algorithm. Its main it encounters at least one of the points. That point is
work flow can be divided in two parts, called offline then a child or leaf cell containing single point, while
phase and online phase. It builds the index(tree) in the the algorithm continues to work recursively on the re-
first phase called, offline phase, so that it can answer maining points.
the queries later on in the other phase called online         Once the tree is built completely, and a query comes
phase. During the offline phase it subdivides space in, we trace the query point down along the tree, by
into rectangular cells through the recursive use of some comparing it’s dimension value with the cut dimen-
splitting rule. In the standard split the splitting dimen- sion value at each level of the tree. Continuing this
sion is chosen based on the maximum spread along a process leads to a leaf node. This leaf node to which
dimension in the data. Whereas in midpoint split the the query point reaches in the end is called its near-
splitting hyper-plane bisects the longest side of the est neighbor. A query point can have more than one
cell and passes through the center of the cell. Most nearest neighbors.
widely accepted KD-Tree implementation in work to-
day is based on the paper written by Songrit Manee-
wongvatana and David M. Mount[4].                          2.2. KNN Classification Based on
   They consider a variant of the midpoint splitting              KD-Tree
rule called sliding-midpoint to overcome the problem The authors of [5] talks about how KD-tree is a spe-
of having a data set in which points might be clustered cial storage structure for data and how it represents
together. The example of sliding-midpoint rule is pre- the training data efficiently. They propose a π‘˜NN-KD-
sented in Figure 1 and Figure 2 depicts the working of tree classification algorithm utilizing the advantages
other splitting rules.                                     offered by the kNN and KD-tree algorithms. They use
   In this approach data is first split at midpoint, by the proposed method on eleven datasets and show that
considering a hyper-plane which passes through the their π‘˜NN-KD-tree algorithm reduces time complex-
center of the cell, dividing the cell’s longest side in ity while significantly improving search performance.
two parts. Then they check if data points exist on Another nearest neighbor search algorithm that uses
both sides of the splitting plane. If so, the splitting random projection and voting is discussed in [6].
                                                                                                                   213


2.3. SVM Based reduced NN                                   that the proposed algorithm outperformed other exist-
     Classification                                         ing practices.

The authors in [3] propose a novel SVM based instance
selection method. Here Support Vector Machine (SVM) 3. Construction of Coreset
is used to form an instance space for instances of a
particular pattern classification problem. A wrapper-
                                                               KD-Tree (CKD-Tree)
based classification performance validation technique Though KD-Tree for classification is a pretty fast algo-
is used to find the best hyperparameters which iden- rithm in itself, it may not be so for very large datasets.
tify the support vector. Then they identify and select To improve on the already fast KD-Tree classification
informative support vectors (instances) lying on the algorithm, and to create an even faster version of KD-
margin hyper-plane in the instance space. Thus deriv- Tree we use similar approach as in the case of cluster-
ing a reduced training set for reduced nearest neighbor ing algorithms, i.e. make use of Coresets. We first use
classification. This reduced training set is then used to a Coreset algorithm to create a representative set of
classify new instances.                                   points from the original data set. This representative
   They demonstrate the performance of the proposed set is then fed to the KD-Tree algorithm to build a tree
instance selection method on some datasets. Although index (offline phase) based on the representative set.
their method maintained or even increased classifica- When a query point arrives, we feed it into the tree,
tion accuracy with a small number of training instances, where it traces down to one of the leaf nodes in the tree
all the datasets chosen are very small in nature. The index. At this point any suitable search method can be
highest number of instances in a dataset being 699 in used to find nearest neighbors to the query point in
Breastw dataset. We focus more on datasets with very the leaf node.
large number of instances.
                                                          Algorithm 1 Lightweight Coreset Construction
2.4. Lightweight Coresets                                 Require: Set of data points X, coreset size m
                                                            1: πœ‡ β†βˆ’ mean of X
Coresets are compact representations of original data
sets. They provide provably competitive results with        2: for π‘₯ ∈ 𝑋 do
                                                                                                 2
models trained on the full data set. As such, coresets      3:   π‘ž(π‘₯) β†βˆ’ 21 |𝑋1 | + 12 βˆ‘ 𝑑(π‘₯,πœ‡)𝑑(π‘₯,πœ‡)2
                                                                                        π‘₯ β€² βˆˆπ‘‹
are successfully used to scale up different clustering      4: end for
models to very large data sets.                             5: 𝐢 ← βˆ’ sample m weighted points from 𝑋 where
   There are several existing Coreset building                 each point π‘₯ has weight π‘š.π‘ž(π‘₯)    1    and is sampled with
methodologies[7]. O. Bachem, and M. Lucic[8] pro-
                                                               probability π‘ž(π‘₯)
posed a notion of lightweight coresets that allowed for
both multiplicative and additive errors. They provided      6: return lightweight coreset C
an algorithm to construct lightweight coresets for k-
means clustering and soft and hard Bregman cluster-          We use Algorithm 1, Lightweight Coreset Construc-
ing. Their algorithm is substantially faster than the tion [8] (LWCS) to create the set of representative points
other existing coreset construction algorithms. It’s par- from the actual dataset. This algorithm takes as in-
allel in the sense that the data can be divided between put a dataset 𝑋 and the coreset size π‘š, i.e. the num-
several threads for parallel computation. Also as the ber of representative points in the coreset. It creates
name suggests, it results in smaller coresets.            a probabilty distribution based on a point’s distance
   Lightweight coreset algorithm uses variance as a from the mean, w.r.t. the total of all such distances.
means to create a probability distribution for the points Distance metric used here is euclidian distance. Once
in data set. Points farther from the mean have higher every point has a probability assigned to it, we sample
probability when compared to points which are closer π‘š points with weight 1 and probabiltiy π‘ž(π‘₯).
to the mean. A small subset of points can then be                                     π‘š.π‘ž(π‘₯)
                                                             Algorithm 2, CKD-Tree Algorithm for classification,
sampled based on the derived probability distribution.
                                                          uses Algorithm 1 Lightweight Coreset Construction
As points are sampled based on the variance in data,
                                                          (LWCS), to process and get a compact version of the
it helps keep classification or clustering properties of
                                                          original large dataset π‘Ÿπ‘’π‘π·π‘Žπ‘‘π‘Ž. This coreset π‘Ÿπ‘’π‘π·π‘Žπ‘‘π‘Ž
datasets. They have showed that their approach natu-
                                                          is then used to build the π‘‘π‘Ÿπ‘’π‘’ index at line 2 of the algo-
rally generalizes to statistical π‘˜-means clustering. They
                                                          rithm. To build the tree index we use sliding-midpoint[4]
also performed extensive experiments, demonstrating
                                                          technique. The π‘‘π‘Ÿπ‘’π‘’ index can then be used to π‘žπ‘’π‘’π‘Ÿ 𝑦
                                                                                                               214




Figure 3: Flow diagram of CKD-Tree Algorithm for Classification


Table 1
Datasets Used
                            Dataset                Number of Instances   Dimensions/Attributes
                            bio_train                   145,751                   74
                      MiniBooNE Particle                130065                    50
                  default of credit card clients         30,000                   24
                             HTRU2                       17898                     9
                           spambase                       4601                    57


the index with a query point. Query requires you to         classes. While datasets bio_train and MiniBooNe Parti-
specify π‘˜ i.e. number of nearest neighbors required         cle are both very large datasets, HTRU2 and spambase
along with the point to query with, i.e. π‘žπ‘’π‘’π‘Ÿ π‘¦π‘ƒπ‘œπ‘–π‘›π‘‘.       are relatively very small. This helps in showing the rel-
The flow diagram of the entire work is presented in         ative performance of CKD-Tree algorithm on different
the Figure 3.                                               types of datasets. Dataset default of credit card clients
   In our specific use case, we use nearest neighbors to    is a more balanced dataset in terms of sample size and
classify the query point into a class. This can be done     dimensionality.
easily based on the majority class in nearest neighbors        We kept 1000 samples from each dataset as test dataset
returned.                                                   for testing the models. These samples are used to check
                                                            the accuracy of the prediction made by the algorithm.
Algorithm 2 CKD-Tree Algorithm For Classification           While testing the VP-Tree and R-Tree, we considered
Require: Large dataset X, coreset size m                    test sample sizes to 10, 50, 100, 200, and 500. Later we
 1: repData ← βˆ’ π‘™π‘–π‘”β„Žπ‘‘π‘€π‘’π‘–π‘”β„Žπ‘‘πΆπ‘œπ‘Ÿπ‘’π‘ π‘’π‘‘π΄π‘™π‘”π‘œπ‘Ÿπ‘–π‘‘β„Žπ‘š ( πΏπ‘Žπ‘Ÿπ‘”π‘’         calculated the average times for them. While building
    π·π‘Žπ‘‘π‘Žπ‘ π‘’π‘‘ 𝑋 , π‘π‘œπ‘Ÿπ‘’π‘ π‘’π‘‘π‘ π‘–π‘§π‘’ π‘š)                              the tree index for KD-Tree, π‘™π‘’π‘Žπ‘“ 𝑆𝑖𝑧𝑒 was kept same
 2: π‘‘π‘Ÿπ‘’π‘’ = 𝐾 𝐷𝑇 π‘Ÿπ‘’π‘’(π‘Ÿπ‘’π‘π·π‘Žπ‘‘π‘Ž)                                as the number of nearest neighbors queried (π‘˜). i.e.,
                                                            π‘™π‘’π‘Žπ‘“ 𝑆𝑖𝑧𝑒 = π‘˜. Here π‘™π‘’π‘Žπ‘“ 𝑆𝑖𝑧𝑒 is the number of points
 3: 𝑑𝑖𝑠𝑑, 𝑁 𝑁 𝐼 𝑛𝑑𝑖𝑐𝑒𝑠 = π‘‘π‘Ÿπ‘’π‘’.π‘žπ‘’π‘’π‘Ÿ 𝑦(π‘žπ‘’π‘’π‘Ÿ π‘¦π‘ƒπ‘œπ‘–π‘›π‘‘, π‘˜ =
                                                            in each leaf node of the tree index.
    π‘›π‘’π‘šπ‘‚π‘“ 𝑁 π‘’π‘–π‘”β„Žπ‘π‘œπ‘Ÿπ‘ )
                                                               We measure the performance based on three fac-
 4: for 𝑖𝑛𝑑𝑒π‘₯ ∈ 𝑁 𝑁 𝐼 𝑛𝑑𝑖𝑐𝑒𝑠 do
                                                            tors, Accuracy of the results, average time(in seconds)
 5:   print point at index in repData i.e. Nearest
                                                            taken in building the tree index and average time(in
      Points
                                                            seconds) taken in answering the query. Each of these
 6: end for
                                                            factors are compared and tabulated separately for all
 7: queryPointClass ←   βˆ’ Majority class of Nearest         the data structures that were used and also for the pro-
    Neighbor Points.                                        posed work.
                                                               Table 2 shows the results of CKD-Tree for π‘˜ = 10.
                                                            We use 3 different coreset sizes π‘š = 1000, π‘š = 2000
                                                            and π‘š = 5000 and find the average of all of them(Avg.
4. Experiments and Results                                  1). For spambase dataset coreset size π‘š = 5000 is not
                                                            generated as data size itself is only 4601. Consider the
We implement the CKD-Tree using the above method-           bio_train dataset, here in table the average value of
ology and compare it against KD-Tree[4] [9] to see the      ’Indexing time’ is calculated by adding the Indexing
performance difference it can provide and the cost.         time of π‘š = 1000, π‘š = 2000 and π‘š = 5000 and finally
  All of the datasets [10] in table 1 have two target       dividing it by 3. The same is applied for Querying time
                                                                                                                 215


Table 2
Proposed work comparison among various coreset sizes for π‘˜ = 10
                                    m=1000                                          m=2000
       π‘˜ = 10      Indexing time    Querying time    Accuracy     Indexing time   Querying time     Accuracy
     spambase       0.131846189      0.002161955       75.1         0.14062953     0.002655576         75
      bio_train     9.025853157      0.003039943       97.6        9.060353756     0.004280325        97.8
       HTRU2        0.393140554      0.002060544       98.9        0.328070402     0.001718247        98.7
     credit card    0.738107204      0.002901038       73.8        0.593619585     0.002796253        74.2
     MiniBooNE      4.820586681      0.001834114       94.8        4.866789103     0.001765214        94.8
                                    m=5000                                Avg. 1(of m = 1000,2000,5000)
                   Indexing time    Querying time    Accuracy     Indexing time Querying time       Accuracy
     spambase           N/A              N/A           N/A          0.13623786      0.002408766        75.05
      bio_train     9.396525383      0.006560953       98.4        9.160910765      0.004627074    97.93333333
       HTRU2        0.312451839      0.002077646       98.7        0.344554265      0.001952145    98.76666667
     credit card     0.7029984       0.003421038       75.1         0.67824173      0.003039443    74.36666667
     MiniBooNE      4.655207396      0.002046397       94.8         4.78086106      0.001881908         94.8


Table 3
Proposed work comparison among various coreset sizes for π‘˜ = 50
                                    m=1000                                          m=2000
       π‘˜ = 50      Indexing time    Querying time    Accuracy     Indexing time   Querying time     Accuracy
     spambase       0.159596443      0.009064549       73.6        0.156191826     0.010310147        71.4
      bio_train      9.68689847      0.018128316       97.6        9.013507843     0.013137584        97.6
       HTRU2        0.346904278      0.006985356       98.7        0.312402248     0.007482661        98.6
     credit card    0.879361391      0.006123379       75.4        0.609235764     0.007201368        75.2
     MiniBooNE      4.948148489      0.011002789       94.8        4.764508009     0.008747934        94.8
                                    m=5000                               Avg. 2(of m = 1000,2000,5000)
                   Indexing time    Querying time    Accuracy     Indexing time Querying time      Accuracy
     spambase           N/A              N/A           N/A         0.157894135     0.009687348        72.5
      bio_train     9.372775555      0.015574522       97.6        9.357727289     0.015613474        97.6
       HTRU2        0.328086615      0.007748176       98.8        0.329131047     0.007405398        98.7
     credit card    0.656133652      0.008404238       75.2        0.714910269     0.007242995    75.26666667
     MiniBooNE      4.717645407      0.00815424        94.8        4.810100635     0.009301654        94.8


and Accuracy of Avg. 1.                                    Tree, KD-Tree and the proposed work.
   Table 3 show the results of CKD-Tree for π‘˜ = 50.           It is observed from Table 5 that the proposed work
We use 3 different coreset sizes π‘š = 1000, π‘š = 2000        out performs all the data structures in Querying time.
and π‘š = 5000 and find the average of all of them(Avg.      Considering the Indexing time, the proposed work also
2). For spambase dataset coreset size π‘š = 5000 is not      performed better than that of VP-Tree and R-Tree. The
generated as data size itself is only 4601. Consider the   accuracy of the proposed work is approximatley close
bio_train dataset, here in table the average value of      to other data structures.
’Indexing time’ is calculated by adding the Indexing          The Figures 4 and 5, given below, show the compar-
time of π‘š = 1000, π‘š = 2000 and π‘š = 5000 and finally        ison of Indexing time and Querying time respectively
dividing it by 3. The same is applied for Querying time    among R-Tree, VP-Tree,KD-Tree and Proposed Work.
and Accuracy of Avg. 2.                                       Among the data structures that were used for com-
   Table 4 presents the average of Avg. 1 and Avg. 2.      parison, KD-Tree is considered to be the best. So we
This average is called ’Overall Avg’. Indexing time of     concentrated mostly on KD-Tree. Here in Table 6 we
’Overall Avg’ is obtained by averaging the ’Indexing       present the indexing time comparison of the KD-Tree
time’ of Avg. 1 and Avg. 2. The same is applied for        and proposed work. As the size of the data increases
’Querying time’ and ’Accuracy’.                            the performance of the proposed work increases and
   Table 5, given below, is the final comparison table.    at a point of time it even starts performing better than
The table presents the comparison among R-Tree, VP-        the KD-Tree. So, for large datasets the proposed work
                                                                                                              216


Table 4
Overall Average of proposed work
                           Cumulative           Overall Avg.((Avg. 1+Avg. 2)/2)
                                          Indexing time Querying time      Accuracy
                           spambase        0.147065997    0.006048057       73.775
                            bio_train      9.259319027    0.010120274    97.76666667
                             HTRU2         0.336842656    0.004678772    98.73333333
                           credit card     0.696575999    0.005141219    74.81666667
                           MiniBooNE       4.795480847    0.005591781         94.8


Table 5
Comparison among R-Tree, VP-Tree, KD-Tree and Proposed Work
                                    R-Tree                                        VP-Tree
                   Indexing time   Querying time    Accuracy   Indexing time    Querying time    Accuracy
     spambase       46.01833411     0.025816591      66.632       0.9678272       0.00872661      88.216
      bio_train     300.3921245     0.674326682       98.6        46.720815      0.027852184       98.6
       HTRU2        98.00138087     0.002142198       96.6        4.2660978      0.011163483        93
     credit card     151.047457     0.08385841        79.8        6.7166709      0.048709915       79.4
     MiniBooNE      250.0596289     0.099793004       99.8       41.5471755      0.021343294       99.8
                                    KD-Tree                                    Proposed Work
     spambase       0.097132921      0.009621621       71.2    0.147065997        0.006048057      73.775
      bio_train     13.37806582      0.079558667       99.3    9.259319027        0.010120274   97.76666667
       HTRU2        0.088246346      0.006940414       98.6    0.336842656        0.004678772   98.73333333
     credit card    0.799064255      0.012309615        75     0.696575999        0.005141219   74.81666667
     MiniBooNE      7.842401028      0.022687735       94.8    4.795480847        0.005591781       94.8


takes less time for creating index.                    6. Future Work
   The breakover point, 30000(credit card dataset), of
the proposed work is shown in the figure Figure 6.     CKD-Tree gives good results on the experimented
                                                       datasets. The probability distribution used here is based
                                                       on the variance of data. Consequently this approach
5. Conclusion                                          might not perform well on noisy or locality sensitive
                                                       data. In such cases a better probability distribution
The above tables show that for at least one value of approach for data summarization could be based on
π‘š each dataset showed competitive or in some cases the locality of data. Locality Sensitive Hashing (LSH)
better accuracy (default credit card and HTRU2) when functions could provide more accurate form of data
used with coresets. In case of larger Datasets such as summarization. For image data vector quantization based
bio_train and MiniBooNE, the coreset size is very less approach might help in data summarization.
compared to the original dataset size. But they still
manage to provide almost same results in terms of ac-    In the online phase, implementation of a more ro-
curacy as the original dataset. Also KD-Tree built on bust and faster search technique could also be fruitful.
the coresets of these datasets see a significant speed Randomizing the algorithm or having multiple tree in-
boost in offline (indexing) and Online (Query) phases. dexes to increase the accuracy is another option.
We can also notice that as the dataset size starts to
decrease, the gap in indexing and query speeds starts
to become smaller and smaller. For smaller datasets
(spambase and HTRU2), this might even lead to higher
query or indexing times.

   This shows that CKD-Tree is a very efficient algo-
rithm for nearest neighbor based classification in large
datasets.
                                                                                                   217




Figure 4: Indexing Time Comparison




Figure 5: Querying Time Comparison


Table 6
Indexing time comparison between KD-Tree and Proposed Work
              Dataset Name    Dataset Size   KD-Tree Indexing time   Proposed Work Indexing time
               spambase          4601            0.097132921                 0.147065997
                 HTRU2           17898           0.088246346                 0.336842656
               credit card       30000           0.799064255                 0.696575999
               MiniBooNE        130065           7.842401028                 4.795480847
                bio_train       145751           13.37806582                 9.259319027
                                                                                                                218




Figure 6: KD-Tree and Proposed Work Comparison



References                                                         International Conference on Big Data (2016).
                                                                   doi:10.1109/BigData.2016.7840682.
 [1] Yianilos, Data structures and algorithms for              [7] O. B. Mario Lucic, A. Krause, Strong coresets for
     nearest neighbor search in general metric spaces,             hard and soft bregman clustering with applica-
     Fourth annual ACM-SIAM symposium on Dis-                      tions to exponential family mixtures., arxiv.org
     crete algorithms (1993). doi:10.1145/313559.                  (2015).
     313789.                                                   [8] A. K. O. Bachem, M. Lucic, Scalable k-means clus-
 [2] A. Guttman, R-trees: A dynamic index structure                tering via lightweight coresets., ACM SIGKDD
     for spatial searching, Proceedings of the 1984                International Conference on Knowledge Discov-
     ACM SIGMOD international conference on Man-                   ery and Data Mining (KDD) (2018). doi:10.
     agement of data – SIGMOD ’84 (1984). doi:10.                  1145/3219819.3219973.
     1145/602264.602266.                                       [9] scipy.org, Spatial kdtree class, 2020. URL:
 [3] C.-C. Huang, H.-Y. Chang, A novel svm-based                   https://docs.scipy.org/doc/scipy-0.14.0/
     reduced nn classification method., 11th Interna-              reference/generated/scipy.spatial.KDTree.html.
     tional Conference on Computational Intelligence          [10] c. ics.uci.edu, Datasets used, 2007. URL:
     and Security (2015). doi:10.1109/CIS.2015.                    https://archive.ics.uci.edu/ml/datasets.php,
     23.                                                           osmot.cs.cornell.edu./kddcup/datasets.html.
 [4] S. Maneewongvatana, D. M. Mount, It’s okay
     to be skinny, if your friends are fat., 4th An-
     nual CGC Workshop on Comptutational Geome-
     try (1999). doi:10.1.1.39.8380.
 [5] C. X. H. Z. Wenfeng Hou, Daiwei Li, T. Li,
     An advanced k nearest neighbor classification
     algorithm based on kd-tree,            IEEE Interna-
     tional Conference of Safety Produce Informa-
     tization (IICSPI) (2018). doi:10.1109/IICSPI.
     2018.8690508.
 [6] S. T. E. J. R. T. L. W. J. C. V. HyvΓΆnen, T. PitkΓ€nen,
     T. Roos, Fast nearest neighbor search through
     sparse random projections and voting., IEEE