-

CKD-TREE: AN IMPROVED KD-TREE CONSTRUCTION ALGORITHM

Y Narasimhulu

Ashok Suthar

Raghunadh Pasunuri

V China Venkaiah

1 0 CSE, Malla Reddy Engineering College(Autonomous) , Maisammaguda(H), Gundlapochampally Village, Medchal Mandal, Medchal-Malkajgiri District, Telangana State, 500100 , India 1 SCIS, University of Hyderabad , Prof. CR Rao Road, Gachibowli, Hyderabad, 500046 , India

211 218

Data structures such as VP-Tree, R-Tree and KD-Tree builds an index of all the data available in the ofline phase and uses that indexed tree to search for and answer nearest neighbor queries or to classify the input query. We use a Lightweight Coreset algorithm to reduce the actual data size used to build the tree index, resulting in a faster index building time. We improve on already available Nearest Neighbor based Classification techniques and pit our classification method against the widely accepted, state of the art data structures such as VP-Tree, R-Tree and KD-Tree. In terms of speed the proposed method out performs the compared data structures, as the size of the data increases.

eol>KD Tree Coresets Nearest Neighbor Classification

1. Introduction by selecting a position in the space called vantage point and partitions the data into two parts. The first part -Nearest Neighbor ( NN) problem refers to the prob- contains data that are closer to vantage point and the lem of finding points or samples in the data which other part which are not closer to the point. The diare closest to the query point. Nearest Neighbor al- vision process continues until there are smaller sets. gorithm finds its use in several machine learning ar- Finally a tree is constructed such that the neigbors in eas, such as classification and regression and is also the tree are also neigbors in the real space. R-Tree[2] the most time-consuming part of these applications. is another data structure that is most commonly used In diferent use cases such as in recommendation sys- to store spatial objects such as location of gas stations, tems, computer vision and robotics etc, fast response restaurants, outlines of agricultural lands and etc. times are critical and using brute force approaches such In this paper we consider NN for classification, where as linear search is not feasible. Hence there are sev- nearest neighbors of a query point in the dataset are eral approaches to solve these Nearest Neighbor prob- used to classify the query point. Nearest neighbor in lems which are based on Hashing, Graphs or Space- essence is a lazy learning algorithm, i.e. it memorizes Partitioning Trees. Space-partitioning methods are gen- the whole training dataset to provide the nearest neigherally more eficient due to less tunable parameters. bors of an incoming query point. Consequently, though

One such algorithm is KD-Tree. It is a space parti- the algorithms provide very eficient solutions to the tioning algorithm which divides space recursively us- nearest neighbor problem, they might run into probing a hyper-plane based on a splitting rul. It reduces lems. This is because data size becomes too large due the search space by almost half at every iteration. An- to the high magnitudes of data available today to proother space partitioning algorithm is Vantage Point Tree cess. In critical systems where time is of essence, loos(VP-Tree)[1], which divides the data in a metric space ing even a few seconds while processing all that data might cause issues. The author in [3] uses SVM to tackle a similar problem by reducing the size of data on which Nearest Neighbor algorithm runs. We use coresets for a similar efect, but on very large datasets.

The concept of coresets follows a data summarization approach. Coresets are small subsets of the original data. They are used to scale clustering problems in massive data sets. Models trained on Coresets provide competitive results against a model trained on full original dataset. Hence these can be very useful in speeding up said models while still keeping up theoritISIC 2021: International Semantic Intelligence Conference, February 25–27, 2021, New Delhi, India " narasimedu@gmail.com (Y. Narasimhulu); ashok.suthar.sce@gmail.com (A. Suthar); raghupasunuri@gmail.com (R. Pasunuri); venkaiah@hotmail.com (V.C. Venkaiah) ~ https://github.com/Narasim (Y. Narasimhulu); https://www.linkedin.com/in/ashok-suthar (A. Suthar); http://cse.mrec.ac.in/StafDetails?FacultyId=3072 (R. Pasunuri); https://scis.uohyd.ac.in/People/profile/vch_profile.php (V.C.

Venkaiah) 0000-0001-5440-0200 (V.C. Venkaiah)

© 2021 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmUmoRns CLicoennsefeArtterinbuctieonP4.r0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org) ical guarantees upto a level. Coresets are often used in clustering algorithms to improve their speed even further. To achieve this, first construct a coreset — usually in linear time — and then use an algorithm that works on coreset to solve the clustering problem. As the coreset size is very small compared to the actual data size, this can provide significant speed in the said algorithms.

We use a state of the art lightweight coreset construction algorithm to improve time in the case of solving Nearest Neighbor problem using KD-Tree space partitioning algorithm. We use the end result of Nearest Neighbor query to classify our input query point based on its nearest neighbor points found.

2. Related Work

As foretold there are a number of approaches to solve a nearest neighbor problem based on hashing, graphs, space-partitioning etc. We focus mainly on KD-Tree, a widely accepted, widely used and fast space partition- Figure 2: Examples of splitting rules on a common data set. ing technique for Nearest Neighbors or Classification problems than VP-Tree and R-Tree. plane is not moved. However, if one side of the split2.1. KD-Tree (K Dimensional-tree) ting plane is empty, i.e. without any point, then they slide the splitting plane towards the data points until KD Tree is a space-partitioning algorithm. Its main it encounters at least one of the points. That point is work flow can be divided in two parts, called ofline then a child or leaf cell containing single point, while phase and online phase. It builds the index(tree) in the the algorithm continues to work recursively on the reifrst phase called, ofline phase, so that it can answer maining points. the queries later on in the other phase called online Once the tree is built completely, and a query comes phase. During the ofline phase it subdivides space in, we trace the query point down along the tree, by into rectangular cells through the recursive use of some comparing it’s dimension value with the cut dimensplitting rule. In the standard split the splitting dimen- sion value at each level of the tree. Continuing this sion is chosen based on the maximum spread along a process leads to a leaf node. This leaf node to which dimension in the data. Whereas in midpoint split the the query point reaches in the end is called its nearsplitting hyper-plane bisects the longest side of the est neighbor. A query point can have more than one cell and passes through the center of the cell. Most nearest neighbors. widely accepted KD-Tree implementation in work today is based on the paper written by Songrit Maneewongvatana and David M. Mount[4]. 2.2. KNN Classification Based on

They consider a variant of the midpoint splitting KD-Tree rule called sliding-midpoint to overcome the problem of having a data set in which points might be clustered together. The example of sliding-midpoint rule is presented in Figure 1 and Figure 2 depicts the working of other splitting rules.

In this approach data is first split at midpoint, by considering a hyper-plane which passes through the center of the cell, dividing the cell’s longest side in two parts. Then they check if data points exist on both sides of the splitting plane. If so, the splitting The authors of [5] talks about how KD-tree is a special storage structure for data and how it represents the training data eficiently. They propose a NN-KDtree classification algorithm utilizing the advantages ofered by the kNN and KD-tree algorithms. They use the proposed method on eleven datasets and show that their NN-KD-tree algorithm reduces time complexity while significantly improving search performance.

Another nearest neighbor search algorithm that uses random projection and voting is discussed in [6].

2.3. SVM Based reduced NN Classification

that the proposed algorithm outperformed other existing practices.

The authors in [3] propose a novel SVM based instance selection method. Here Support Vector Machine (SVM) 3. Construction of Coreset tion accuracy with a small number of training instances, where it traces down to one of the leaf nodes in the tree is used to form an instance space for instances of a particular pattern classification problem. A wrapperbased classification performance validation technique is used to find the best hyperparameters which identify the support vector. Then they identify and select informative support vectors (instances) lying on the margin hyper-plane in the instance space. Thus deriving a reduced training set for reduced nearest neighbor classification. This reduced training set is then used to classify new instances.

They demonstrate the performance of the proposed

instance selection method on some datasets. Although their method maintained or even increased classificaall the datasets chosen are very small in nature. The highest number of instances in a dataset being 699 in

Breastw dataset. We focus more on datasets with very large number of instances. 2.4. Lightweight Coresets Coresets are compact representations of original data

sets. They provide provably competitive results with models trained on the full data set. As such, coresets are successfully used to scale up diferent clustering models to very large data sets.

There are several existing Coreset building methodologies[7]. O. Bachem, and M. Lucic[8] proposed a notion of lightweight coresets that allowed for both multiplicative and additive errors. They provided an algorithm to construct lightweight coresets for kmeans clustering and soft and hard Bregman clustering. Their algorithm is substantially faster than the allel in the sense that the data can be divided between several threads for parallel computation. Also as the name suggests, it results in smaller coresets.

Lightweight coreset algorithm uses variance as a

means to create a probability distribution for the points in data set. Points farther from the mean have higher probability when compared to points which are closer to the mean. A small subset of points can then be sampled based on the derived probability distribution.

As points are sampled based on the variance in data,

it helps keep classification or clustering properties of datasets. They have showed that their approach naturally generalizes to statistical -means clustering. They also performed extensive experiments, demonstrating other existing coreset construction algorithms. It’s par- from the actual dataset. This algorithm takes as in

KD-Tree (CKD-Tree) Though KD-Tree for classification is a pretty fast algo

rithm in itself, it may not be so for very large datasets.

To improve on the already fast KD-Tree classification

algorithm, and to create an even faster version of KD

Tree we use similar approach as in the case of cluster

ing algorithms, i.e. make use of Coresets. We first use a Coreset algorithm to create a representative set of points from the original data set. This representative set is then fed to the KD-Tree algorithm to build a tree index (ofline phase) based on the representative set.

When a query point arrives, we feed it into the tree, index. At this point any suitable search method can be used to find nearest neighbors to the query point in the leaf node. Algorithm 1 Lightweight Coreset Construction

Require: Set of data points X, coreset size m 1: ←− mean of X 2: for ∈

do 3: 5: 4: end for ( ) ←− 12 | 1 | + 21 ∑ ′∈ (, )2

(, )2 each point has weight probability ( ) 1 . ( )

6: return lightweight coreset C

←− sample m weighted points from

where and is sampled with

We use Algorithm 1, Lightweight Coreset Construction [8] (LWCS) to create the set of representative points put a dataset

and the coreset size , i.e. the number of representative points in the coreset. It creates a probabilty distribution based on a point’s distance from the mean, w.r.t. the total of all such distances.

Distance metric used here is euclidian distance. Once every point has a probability assigned to it, we sample points with weight . 1 ( )

and probabiltiy ( ).

Algorithm 2, CKD-Tree Algorithm for classification, uses Algorithm 1 Lightweight Coreset Construction (LWCS), to process and get a compact version of the original large dataset is then used to build the

. This coreset

index at line 2 of the algorithm. To build the tree index we use sliding-midpoint[4] technique. The index can then be used to the index with a query point. Query requires you to classes. While datasets bio_train and MiniBooNe Partispecify i.e. number of nearest neighbors required cle are both very large datasets, HTRU2 and spambase along with the point to query with, i.e. . are relatively very small. This helps in showing the relThe flow diagram of the entire work is presented in ative performance of CKD-Tree algorithm on diferent the Figure 3. types of datasets. Dataset default of credit card clients

In our specific use case, we use nearest neighbors to is a more balanced dataset in terms of sample size and classify the query point into a class. This can be done dimensionality. easily based on the majority class in nearest neighbors We kept 1000 samples from each dataset as test dataset returned. for testing the models. These samples are used to check the accuracy of the prediction made by the algorithm.

Algorithm 2 CKD-Tree Algorithm For Classification While testing the VP-Tree and R-Tree, we considered Require: Large dataset X, coreset size m test sample sizes to 10, 50, 100, 200, and 500. Later we 1: repData ←− ℎℎ ℎ ( calculated the average times for them. While building , ) the tree index for KD-Tree, was kept same 23:: , = ( = . ) ( , = ians ethacehnlueam=f bne.ordHoeefornefet ahreestrteneeiingidshebtxho.ersnquumebrieerdo(f p).oiin.et.s, ℎ ) We measure the performance based on three fac45:: forp r int po∈in t at index indorepData i.e. Nearest ttaokrse,nAicncubruaicldyinofgththeeretrseueltsin,adveexraagned taimveer(aingesetcimoned(isn) Points

seconds) taken in answering the query. Each of these 6: end for factors are compared and tabulated separately for all 7: queryPointClass ←− Majority class of Nearest the data structures that were used and also for the proNeighbor Points. posed work.

Table 2 shows the results of CKD-Tree for = 10.

We use 3 diferent coreset sizes = 1000, = 2000 and = 5000 and find the average of all of them(Avg. 4. Experiments and Results 1). For spambase dataset coreset size = 5000 is not generated as data size itself is only 4601. Consider the We implement the CKD-Tree using the above method- bio_train dataset, here in table the average value of ology and compare it against KD-Tree[4] [9] to see the ’Indexing time’ is calculated by adding the Indexing performance diference it can provide and the cost. time of = 1000, = 2000 and = 5000 and finally

All of the datasets [ 10 ] in table 1 have two target dividing it by 3. The same is applied for Querying time and Accuracy of Avg. 1. Tree, KD-Tree and the proposed work.

Table 3 show the results of CKD-Tree for = 50. It is observed from Table 5 that the proposed work We use 3 diferent coreset sizes = 1000, = 2000 out performs all the data structures in Querying time. and = 5000 and find the average of all of them(Avg. Considering the Indexing time, the proposed work also 2). For spambase dataset coreset size = 5000 is not performed better than that of VP-Tree and R-Tree. The generated as data size itself is only 4601. Consider the accuracy of the proposed work is approximatley close bio_train dataset, here in table the average value of to other data structures. ’Indexing time’ is calculated by adding the Indexing The Figures 4 and 5, given below, show the compartime of = 1000, = 2000 and = 5000 and finally ison of Indexing time and Querying time respectively dividing it by 3. The same is applied for Querying time among R-Tree, VP-Tree,KD-Tree and Proposed Work. and Accuracy of Avg. 2. Among the data structures that were used for com

Table 4 presents the average of Avg. 1 and Avg. 2. parison, KD-Tree is considered to be the best. So we This average is called ’Overall Avg’. Indexing time of concentrated mostly on KD-Tree. Here in Table 6 we ’Overall Avg’ is obtained by averaging the ’Indexing present the indexing time comparison of the KD-Tree time’ of Avg. 1 and Avg. 2. The same is applied for and proposed work. As the size of the data increases ’Querying time’ and ’Accuracy’. the performance of the proposed work increases and

Table 5, given below, is the final comparison table. at a point of time it even starts performing better than The table presents the comparison among R-Tree, VP- the KD-Tree. So, for large datasets the proposed work Comparison among R-Tree, VP-Tree, KD-Tree and Proposed Work

Indexing time

Querying time

Indexing time

Querying time spambase bio_train

HTRU2 credit card MiniBooNE spambase bio_train

HTRU2 credit card MiniBooNE

5. Conclusion

The above tables show that for at least one value of

each dataset showed competitive or in some cases better accuracy (default credit card and HTRU2) when used with coresets. In case of larger Datasets such as bio_train and MiniBooNE, the coreset size is very less compared to the original dataset size. But they still manage to provide almost same results in terms of accuracy as the original dataset. Also KD-Tree built on the coresets of these datasets see a significant speed boost in ofline (indexing) and Online (Query) phases.

We can also notice that as the dataset size starts to

decrease, the gap in indexing and query speeds starts to become smaller and smaller. For smaller datasets (spambase and HTRU2), this might even lead to higher query or indexing times.

This shows that CKD-Tree is a very eficient algorithm for nearest neighbor based classification in large datasets.

Accuracy R-Tree 0.025816591

6. Future Work

CKD-Tree gives good results on the experimented datasets. The probability distribution used here is based on the variance of data. Consequently this approach might not perform well on noisy or locality sensitive data. In such cases a better probability distribution approach for data summarization could be based on the locality of data. Locality Sensitive Hashing (LSH) functions could provide more accurate form of data summarization. For image data vector quantization based approach might help in data summarization.

In the online phase, implementation of a more ro

bust and faster search technique could also be fruitful.

Randomizing the algorithm or having multiple tree indexes to increase the accuracy is another option.

International Conference on Big Data ( 2016 ).

doi:10 .1109/BigData. 2016 . 7840682 . [1] Yianilos , Data structures and algorithms for [7]

O. B.

Mario Lucic ,

Krause , Strong coresets for

crete algorithms ( 1993 ). doi: 10 .1145/313559. ( 2015 ).

313789. [8] A. K. O. Bachem , M. Lucic , Scalable k-means clus[2] A. Guttman , R-trees: A dynamic index structure tering via lightweight coresets ., ACM SIGKDD

for spatial searching , Proceedings of the 1984 International Conference on Knowledge Discov-

ACM SIGMOD international conference on Man- ery and Data Mining (KDD) ( 2018 ). doi:10.

agement of data - SIGMOD '84 ( 1984 ). doi: 10 . 1145/3219819.3219973.

1145 /602264.602266. [9] scipy.org, Spatial kdtree class, 2020 . URL: [3] C.-C. Huang , H.- Y. Chang , A novel svm-based https://docs .scipy.org/doc/scipy-0. 14 .0/

reduced nn classification method ., 11th Interna- reference/generated/scipy.spatial.KDTree.html.

tional Conference on Computational Intelligence [10] c. ics.uci.edu, Datasets used, 2007 . URL:

and Security ( 2015 ). doi: 10 .1109/CIS. 2015 . https://archive.ics.uci.edu/ml/datasets.php,

23. osmot.cs.cornell.edu./kddcup/datasets.html. [4]

Maneewongvatana ,

D. M.

Mount , It's okay

to be skinny, if your friends are fat ., 4th An-

try ( 1999 ). doi:10.1.1.39 .8380. [5]

C. X. H. Z.

Wenfeng Hou ,

Daiwei

Li ,

tization (

IICSPI

) ( 2018 ). doi: 10 .1109/IICSPI.

2018 . 8690508 . [6]

S. T. E. J. R. T. L. W. J. C. V.

Hyvönen , T. Pitkänen,