=Paper=
{{Paper
|id=Vol-2786/Paper28
|storemode=property
|title=CKD-Tree: An Improved KD-Tree Construction Algorithm
|pdfUrl=https://ceur-ws.org/Vol-2786/Paper28.pdf
|volume=Vol-2786
|authors=Y Narasimhulu,Ashok Suthar,Raghunadh Pasunuri,V China Venkaiah
|dblpUrl=https://dblp.org/rec/conf/isic2/NarasimhuluSPV21
}}
==CKD-Tree: An Improved KD-Tree Construction Algorithm==
CKD-TREE: AN IMPROVED KD-TREE CONSTRUCTION ALGORITHM Y Narasimhulua , Ashok Suthara , Raghunadh Pasunurib and V China Venkaiaha a SCIS, University of Hyderabad, Prof. CR Rao Road, Gachibowli, Hyderabad, 500046, India b CSE, Malla Reddy Engineering College(Autonomous), Maisammaguda(H), Gundlapochampally Village, Medchal Mandal, Medchal-Malkajgiri District, Telangana State, 500100, India Abstract Data structures such as VP-Tree, R-Tree and KD-Tree builds an index of all the data available in the offline phase and uses that indexed tree to search for and answer nearest neighbor queries or to classify the input query. We use a Lightweight Coreset algorithm to reduce the actual data size used to build the tree index, resulting in a faster index building time. We improve on already available Nearest Neighbor based Classification techniques and pit our classification method against the widely accepted, state of the art data structures such as VP-Tree, R-Tree and KD-Tree. In terms of speed the proposed method out performs the compared data structures, as the size of the data increases. Keywords KD Tree, Coresets, Nearest Neighbor, Classification. 1. Introduction by selecting a position in the space called vantage point and partitions the data into two parts. The first part π-Nearest Neighbor (πNN) problem refers to the prob- contains data that are closer to vantage point and the lem of finding π points or samples in the data which other part which are not closer to the point. The di- are closest to the query point. Nearest Neighbor al- vision process continues until there are smaller sets. gorithm finds its use in several machine learning ar- Finally a tree is constructed such that the neigbors in eas, such as classification and regression and is also the tree are also neigbors in the real space. R-Tree[2] the most time-consuming part of these applications. is another data structure that is most commonly used In different use cases such as in recommendation sys- to store spatial objects such as location of gas stations, tems, computer vision and robotics etc, fast response restaurants, outlines of agricultural lands and etc. times are critical and using brute force approaches such In this paper we consider πNN for classification, where as linear search is not feasible. Hence there are sev- nearest neighbors of a query point in the dataset are eral approaches to solve these Nearest Neighbor prob- used to classify the query point. Nearest neighbor in lems which are based on Hashing, Graphs or Space- essence is a lazy learning algorithm, i.e. it memorizes Partitioning Trees. Space-partitioning methods are gen- the whole training dataset to provide the nearest neigh- erally more efficient due to less tunable parameters. bors of an incoming query point. Consequently, though One such algorithm is KD-Tree. It is a space parti- the algorithms provide very efficient solutions to the tioning algorithm which divides space recursively us- nearest neighbor problem, they might run into prob- ing a hyper-plane based on a splitting rul. It reduces lems. This is because data size becomes too large due the search space by almost half at every iteration. An- to the high magnitudes of data available today to pro- other space partitioning algorithm is Vantage Point Tree cess. In critical systems where time is of essence, loos- (VP-Tree)[1], which divides the data in a metric space ing even a few seconds while processing all that data ISIC 2021: International Semantic Intelligence Conference, February might cause issues. The author in [3] uses SVM to 25β27, 2021, New Delhi, India tackle a similar problem by reducing the size of data " narasimedu@gmail.com (Y. Narasimhulu); on which Nearest Neighbor algorithm runs. We use ashok.suthar.sce@gmail.com (A. Suthar); raghupasunuri@gmail.com (R. Pasunuri); venkaiah@hotmail.com coresets for a similar effect, but on very large datasets. (V.C. Venkaiah) The concept of coresets follows a data summariza- ~ https://github.com/Narasim (Y. Narasimhulu); tion approach. Coresets are small subsets of the orig- https://www.linkedin.com/in/ashok-suthar (A. Suthar); inal data. They are used to scale clustering problems http://cse.mrec.ac.in/StaffDetails?FacultyId=3072 (R. Pasunuri); https://scis.uohyd.ac.in/People/profile/vch_profile.php (V.C. in massive data sets. Models trained on Coresets pro- Venkaiah) vide competitive results against a model trained on full 0000-0001-5440-0200 (V.C. Venkaiah) original dataset. Hence these can be very useful in Β© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). speeding up said models while still keeping up theorit- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Conference Proceedings (CEUR-WS.org) 212 ical guarantees upto a level. Coresets are often used in clustering algorithms to improve their speed even fur- ther. To achieve this, first construct a coreset β usu- ally in linear time β and then use an algorithm that works on coreset to solve the clustering problem. As the coreset size is very small compared to the actual data size, this can provide significant speed in the said algorithms. We use a state of the art lightweight coreset con- struction algorithm to improve time in the case of solv- Figure 1: Example of the sliding-midpoint rule. ing Nearest Neighbor problem using KD-Tree space partitioning algorithm. We use the end result of Near- est Neighbor query to classify our input query point based on its nearest neighbor points found. 2. Related Work As foretold there are a number of approaches to solve a nearest neighbor problem based on hashing, graphs, space-partitioning etc. We focus mainly on KD-Tree, a widely accepted, widely used and fast space partition- Figure 2: Examples of splitting rules on a common data set. ing technique for Nearest Neighbors or Classification problems than VP-Tree and R-Tree. plane is not moved. However, if one side of the split- 2.1. KD-Tree (K Dimensional-tree) ting plane is empty, i.e. without any point, then they slide the splitting plane towards the data points until KD Tree is a space-partitioning algorithm. Its main it encounters at least one of the points. That point is work flow can be divided in two parts, called offline then a child or leaf cell containing single point, while phase and online phase. It builds the index(tree) in the the algorithm continues to work recursively on the re- first phase called, offline phase, so that it can answer maining points. the queries later on in the other phase called online Once the tree is built completely, and a query comes phase. During the offline phase it subdivides space in, we trace the query point down along the tree, by into rectangular cells through the recursive use of some comparing itβs dimension value with the cut dimen- splitting rule. In the standard split the splitting dimen- sion value at each level of the tree. Continuing this sion is chosen based on the maximum spread along a process leads to a leaf node. This leaf node to which dimension in the data. Whereas in midpoint split the the query point reaches in the end is called its near- splitting hyper-plane bisects the longest side of the est neighbor. A query point can have more than one cell and passes through the center of the cell. Most nearest neighbors. widely accepted KD-Tree implementation in work to- day is based on the paper written by Songrit Manee- wongvatana and David M. Mount[4]. 2.2. KNN Classification Based on They consider a variant of the midpoint splitting KD-Tree rule called sliding-midpoint to overcome the problem The authors of [5] talks about how KD-tree is a spe- of having a data set in which points might be clustered cial storage structure for data and how it represents together. The example of sliding-midpoint rule is pre- the training data efficiently. They propose a πNN-KD- sented in Figure 1 and Figure 2 depicts the working of tree classification algorithm utilizing the advantages other splitting rules. offered by the kNN and KD-tree algorithms. They use In this approach data is first split at midpoint, by the proposed method on eleven datasets and show that considering a hyper-plane which passes through the their πNN-KD-tree algorithm reduces time complex- center of the cell, dividing the cellβs longest side in ity while significantly improving search performance. two parts. Then they check if data points exist on Another nearest neighbor search algorithm that uses both sides of the splitting plane. If so, the splitting random projection and voting is discussed in [6]. 213 2.3. SVM Based reduced NN that the proposed algorithm outperformed other exist- Classification ing practices. The authors in [3] propose a novel SVM based instance selection method. Here Support Vector Machine (SVM) 3. Construction of Coreset is used to form an instance space for instances of a particular pattern classification problem. A wrapper- KD-Tree (CKD-Tree) based classification performance validation technique Though KD-Tree for classification is a pretty fast algo- is used to find the best hyperparameters which iden- rithm in itself, it may not be so for very large datasets. tify the support vector. Then they identify and select To improve on the already fast KD-Tree classification informative support vectors (instances) lying on the algorithm, and to create an even faster version of KD- margin hyper-plane in the instance space. Thus deriv- Tree we use similar approach as in the case of cluster- ing a reduced training set for reduced nearest neighbor ing algorithms, i.e. make use of Coresets. We first use classification. This reduced training set is then used to a Coreset algorithm to create a representative set of classify new instances. points from the original data set. This representative They demonstrate the performance of the proposed set is then fed to the KD-Tree algorithm to build a tree instance selection method on some datasets. Although index (offline phase) based on the representative set. their method maintained or even increased classifica- When a query point arrives, we feed it into the tree, tion accuracy with a small number of training instances, where it traces down to one of the leaf nodes in the tree all the datasets chosen are very small in nature. The index. At this point any suitable search method can be highest number of instances in a dataset being 699 in used to find nearest neighbors to the query point in Breastw dataset. We focus more on datasets with very the leaf node. large number of instances. Algorithm 1 Lightweight Coreset Construction 2.4. Lightweight Coresets Require: Set of data points X, coreset size m 1: π ββ mean of X Coresets are compact representations of original data sets. They provide provably competitive results with 2: for π₯ β π do 2 models trained on the full data set. As such, coresets 3: π(π₯) ββ 21 |π1 | + 12 β π(π₯,π)π(π₯,π)2 π₯ β² βπ are successfully used to scale up different clustering 4: end for models to very large data sets. 5: πΆ β β sample m weighted points from π where There are several existing Coreset building each point π₯ has weight π.π(π₯) 1 and is sampled with methodologies[7]. O. Bachem, and M. Lucic[8] pro- probability π(π₯) posed a notion of lightweight coresets that allowed for both multiplicative and additive errors. They provided 6: return lightweight coreset C an algorithm to construct lightweight coresets for k- means clustering and soft and hard Bregman cluster- We use Algorithm 1, Lightweight Coreset Construc- ing. Their algorithm is substantially faster than the tion [8] (LWCS) to create the set of representative points other existing coreset construction algorithms. Itβs par- from the actual dataset. This algorithm takes as in- allel in the sense that the data can be divided between put a dataset π and the coreset size π, i.e. the num- several threads for parallel computation. Also as the ber of representative points in the coreset. It creates name suggests, it results in smaller coresets. a probabilty distribution based on a pointβs distance Lightweight coreset algorithm uses variance as a from the mean, w.r.t. the total of all such distances. means to create a probability distribution for the points Distance metric used here is euclidian distance. Once in data set. Points farther from the mean have higher every point has a probability assigned to it, we sample probability when compared to points which are closer π points with weight 1 and probabiltiy π(π₯). to the mean. A small subset of points can then be π.π(π₯) Algorithm 2, CKD-Tree Algorithm for classification, sampled based on the derived probability distribution. uses Algorithm 1 Lightweight Coreset Construction As points are sampled based on the variance in data, (LWCS), to process and get a compact version of the it helps keep classification or clustering properties of original large dataset ππππ·ππ‘π. This coreset ππππ·ππ‘π datasets. They have showed that their approach natu- is then used to build the π‘πππ index at line 2 of the algo- rally generalizes to statistical π-means clustering. They rithm. To build the tree index we use sliding-midpoint[4] also performed extensive experiments, demonstrating technique. The π‘πππ index can then be used to ππ’ππ π¦ 214 Figure 3: Flow diagram of CKD-Tree Algorithm for Classification Table 1 Datasets Used Dataset Number of Instances Dimensions/Attributes bio_train 145,751 74 MiniBooNE Particle 130065 50 default of credit card clients 30,000 24 HTRU2 17898 9 spambase 4601 57 the index with a query point. Query requires you to classes. While datasets bio_train and MiniBooNe Parti- specify π i.e. number of nearest neighbors required cle are both very large datasets, HTRU2 and spambase along with the point to query with, i.e. ππ’ππ π¦πππππ‘. are relatively very small. This helps in showing the rel- The flow diagram of the entire work is presented in ative performance of CKD-Tree algorithm on different the Figure 3. types of datasets. Dataset default of credit card clients In our specific use case, we use nearest neighbors to is a more balanced dataset in terms of sample size and classify the query point into a class. This can be done dimensionality. easily based on the majority class in nearest neighbors We kept 1000 samples from each dataset as test dataset returned. for testing the models. These samples are used to check the accuracy of the prediction made by the algorithm. Algorithm 2 CKD-Tree Algorithm For Classification While testing the VP-Tree and R-Tree, we considered Require: Large dataset X, coreset size m test sample sizes to 10, 50, 100, 200, and 500. Later we 1: repData β β πππβπ‘π€πππβπ‘πΆππππ ππ‘π΄ππππππ‘βπ ( πΏππππ calculated the average times for them. While building π·ππ‘ππ ππ‘ π , πππππ ππ‘π ππ§π π) the tree index for KD-Tree, ππππ πππ§π was kept same 2: π‘πππ = πΎ π·π πππ(ππππ·ππ‘π) as the number of nearest neighbors queried (π). i.e., ππππ πππ§π = π. Here ππππ πππ§π is the number of points 3: πππ π‘, π π πΌ ππππππ = π‘πππ.ππ’ππ π¦(ππ’ππ π¦πππππ‘, π = in each leaf node of the tree index. ππ’πππ π πππβππππ ) We measure the performance based on three fac- 4: for πππππ₯ β π π πΌ ππππππ do tors, Accuracy of the results, average time(in seconds) 5: print point at index in repData i.e. Nearest taken in building the tree index and average time(in Points seconds) taken in answering the query. Each of these 6: end for factors are compared and tabulated separately for all 7: queryPointClass β β Majority class of Nearest the data structures that were used and also for the pro- Neighbor Points. posed work. Table 2 shows the results of CKD-Tree for π = 10. We use 3 different coreset sizes π = 1000, π = 2000 and π = 5000 and find the average of all of them(Avg. 4. Experiments and Results 1). For spambase dataset coreset size π = 5000 is not generated as data size itself is only 4601. Consider the We implement the CKD-Tree using the above method- bio_train dataset, here in table the average value of ology and compare it against KD-Tree[4] [9] to see the βIndexing timeβ is calculated by adding the Indexing performance difference it can provide and the cost. time of π = 1000, π = 2000 and π = 5000 and finally All of the datasets [10] in table 1 have two target dividing it by 3. The same is applied for Querying time 215 Table 2 Proposed work comparison among various coreset sizes for π = 10 m=1000 m=2000 π = 10 Indexing time Querying time Accuracy Indexing time Querying time Accuracy spambase 0.131846189 0.002161955 75.1 0.14062953 0.002655576 75 bio_train 9.025853157 0.003039943 97.6 9.060353756 0.004280325 97.8 HTRU2 0.393140554 0.002060544 98.9 0.328070402 0.001718247 98.7 credit card 0.738107204 0.002901038 73.8 0.593619585 0.002796253 74.2 MiniBooNE 4.820586681 0.001834114 94.8 4.866789103 0.001765214 94.8 m=5000 Avg. 1(of m = 1000,2000,5000) Indexing time Querying time Accuracy Indexing time Querying time Accuracy spambase N/A N/A N/A 0.13623786 0.002408766 75.05 bio_train 9.396525383 0.006560953 98.4 9.160910765 0.004627074 97.93333333 HTRU2 0.312451839 0.002077646 98.7 0.344554265 0.001952145 98.76666667 credit card 0.7029984 0.003421038 75.1 0.67824173 0.003039443 74.36666667 MiniBooNE 4.655207396 0.002046397 94.8 4.78086106 0.001881908 94.8 Table 3 Proposed work comparison among various coreset sizes for π = 50 m=1000 m=2000 π = 50 Indexing time Querying time Accuracy Indexing time Querying time Accuracy spambase 0.159596443 0.009064549 73.6 0.156191826 0.010310147 71.4 bio_train 9.68689847 0.018128316 97.6 9.013507843 0.013137584 97.6 HTRU2 0.346904278 0.006985356 98.7 0.312402248 0.007482661 98.6 credit card 0.879361391 0.006123379 75.4 0.609235764 0.007201368 75.2 MiniBooNE 4.948148489 0.011002789 94.8 4.764508009 0.008747934 94.8 m=5000 Avg. 2(of m = 1000,2000,5000) Indexing time Querying time Accuracy Indexing time Querying time Accuracy spambase N/A N/A N/A 0.157894135 0.009687348 72.5 bio_train 9.372775555 0.015574522 97.6 9.357727289 0.015613474 97.6 HTRU2 0.328086615 0.007748176 98.8 0.329131047 0.007405398 98.7 credit card 0.656133652 0.008404238 75.2 0.714910269 0.007242995 75.26666667 MiniBooNE 4.717645407 0.00815424 94.8 4.810100635 0.009301654 94.8 and Accuracy of Avg. 1. Tree, KD-Tree and the proposed work. Table 3 show the results of CKD-Tree for π = 50. It is observed from Table 5 that the proposed work We use 3 different coreset sizes π = 1000, π = 2000 out performs all the data structures in Querying time. and π = 5000 and find the average of all of them(Avg. Considering the Indexing time, the proposed work also 2). For spambase dataset coreset size π = 5000 is not performed better than that of VP-Tree and R-Tree. The generated as data size itself is only 4601. Consider the accuracy of the proposed work is approximatley close bio_train dataset, here in table the average value of to other data structures. βIndexing timeβ is calculated by adding the Indexing The Figures 4 and 5, given below, show the compar- time of π = 1000, π = 2000 and π = 5000 and finally ison of Indexing time and Querying time respectively dividing it by 3. The same is applied for Querying time among R-Tree, VP-Tree,KD-Tree and Proposed Work. and Accuracy of Avg. 2. Among the data structures that were used for com- Table 4 presents the average of Avg. 1 and Avg. 2. parison, KD-Tree is considered to be the best. So we This average is called βOverall Avgβ. Indexing time of concentrated mostly on KD-Tree. Here in Table 6 we βOverall Avgβ is obtained by averaging the βIndexing present the indexing time comparison of the KD-Tree timeβ of Avg. 1 and Avg. 2. The same is applied for and proposed work. As the size of the data increases βQuerying timeβ and βAccuracyβ. the performance of the proposed work increases and Table 5, given below, is the final comparison table. at a point of time it even starts performing better than The table presents the comparison among R-Tree, VP- the KD-Tree. So, for large datasets the proposed work 216 Table 4 Overall Average of proposed work Cumulative Overall Avg.((Avg. 1+Avg. 2)/2) Indexing time Querying time Accuracy spambase 0.147065997 0.006048057 73.775 bio_train 9.259319027 0.010120274 97.76666667 HTRU2 0.336842656 0.004678772 98.73333333 credit card 0.696575999 0.005141219 74.81666667 MiniBooNE 4.795480847 0.005591781 94.8 Table 5 Comparison among R-Tree, VP-Tree, KD-Tree and Proposed Work R-Tree VP-Tree Indexing time Querying time Accuracy Indexing time Querying time Accuracy spambase 46.01833411 0.025816591 66.632 0.9678272 0.00872661 88.216 bio_train 300.3921245 0.674326682 98.6 46.720815 0.027852184 98.6 HTRU2 98.00138087 0.002142198 96.6 4.2660978 0.011163483 93 credit card 151.047457 0.08385841 79.8 6.7166709 0.048709915 79.4 MiniBooNE 250.0596289 0.099793004 99.8 41.5471755 0.021343294 99.8 KD-Tree Proposed Work spambase 0.097132921 0.009621621 71.2 0.147065997 0.006048057 73.775 bio_train 13.37806582 0.079558667 99.3 9.259319027 0.010120274 97.76666667 HTRU2 0.088246346 0.006940414 98.6 0.336842656 0.004678772 98.73333333 credit card 0.799064255 0.012309615 75 0.696575999 0.005141219 74.81666667 MiniBooNE 7.842401028 0.022687735 94.8 4.795480847 0.005591781 94.8 takes less time for creating index. 6. Future Work The breakover point, 30000(credit card dataset), of the proposed work is shown in the figure Figure 6. CKD-Tree gives good results on the experimented datasets. The probability distribution used here is based on the variance of data. Consequently this approach 5. Conclusion might not perform well on noisy or locality sensitive data. In such cases a better probability distribution The above tables show that for at least one value of approach for data summarization could be based on π each dataset showed competitive or in some cases the locality of data. Locality Sensitive Hashing (LSH) better accuracy (default credit card and HTRU2) when functions could provide more accurate form of data used with coresets. In case of larger Datasets such as summarization. For image data vector quantization based bio_train and MiniBooNE, the coreset size is very less approach might help in data summarization. compared to the original dataset size. But they still manage to provide almost same results in terms of ac- In the online phase, implementation of a more ro- curacy as the original dataset. Also KD-Tree built on bust and faster search technique could also be fruitful. the coresets of these datasets see a significant speed Randomizing the algorithm or having multiple tree in- boost in offline (indexing) and Online (Query) phases. dexes to increase the accuracy is another option. We can also notice that as the dataset size starts to decrease, the gap in indexing and query speeds starts to become smaller and smaller. For smaller datasets (spambase and HTRU2), this might even lead to higher query or indexing times. This shows that CKD-Tree is a very efficient algo- rithm for nearest neighbor based classification in large datasets. 217 Figure 4: Indexing Time Comparison Figure 5: Querying Time Comparison Table 6 Indexing time comparison between KD-Tree and Proposed Work Dataset Name Dataset Size KD-Tree Indexing time Proposed Work Indexing time spambase 4601 0.097132921 0.147065997 HTRU2 17898 0.088246346 0.336842656 credit card 30000 0.799064255 0.696575999 MiniBooNE 130065 7.842401028 4.795480847 bio_train 145751 13.37806582 9.259319027 218 Figure 6: KD-Tree and Proposed Work Comparison References International Conference on Big Data (2016). doi:10.1109/BigData.2016.7840682. [1] Yianilos, Data structures and algorithms for [7] O. B. Mario Lucic, A. Krause, Strong coresets for nearest neighbor search in general metric spaces, hard and soft bregman clustering with applica- Fourth annual ACM-SIAM symposium on Dis- tions to exponential family mixtures., arxiv.org crete algorithms (1993). doi:10.1145/313559. (2015). 313789. [8] A. K. O. Bachem, M. Lucic, Scalable k-means clus- [2] A. Guttman, R-trees: A dynamic index structure tering via lightweight coresets., ACM SIGKDD for spatial searching, Proceedings of the 1984 International Conference on Knowledge Discov- ACM SIGMOD international conference on Man- ery and Data Mining (KDD) (2018). doi:10. agement of data β SIGMOD β84 (1984). doi:10. 1145/3219819.3219973. 1145/602264.602266. [9] scipy.org, Spatial kdtree class, 2020. URL: [3] C.-C. Huang, H.-Y. Chang, A novel svm-based https://docs.scipy.org/doc/scipy-0.14.0/ reduced nn classification method., 11th Interna- reference/generated/scipy.spatial.KDTree.html. tional Conference on Computational Intelligence [10] c. ics.uci.edu, Datasets used, 2007. URL: and Security (2015). doi:10.1109/CIS.2015. https://archive.ics.uci.edu/ml/datasets.php, 23. osmot.cs.cornell.edu./kddcup/datasets.html. [4] S. Maneewongvatana, D. M. Mount, Itβs okay to be skinny, if your friends are fat., 4th An- nual CGC Workshop on Comptutational Geome- try (1999). doi:10.1.1.39.8380. [5] C. X. H. Z. Wenfeng Hou, Daiwei Li, T. Li, An advanced k nearest neighbor classification algorithm based on kd-tree, IEEE Interna- tional Conference of Safety Produce Informa- tization (IICSPI) (2018). doi:10.1109/IICSPI. 2018.8690508. [6] S. T. E. J. R. T. L. W. J. C. V. HyvΓΆnen, T. PitkΓ€nen, T. Roos, Fast nearest neighbor search through sparse random projections and voting., IEEE