The Effectiveness of PCA in KNN, Gaussian Naive Bayes Classifier and SVM for Raisin Dataset Agnieszka Polowczyk1 , Alicja Polowczyk1 1 Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44-100 Gliwice, Poland Abstract Supervised learning is one of the main types machine learning in which model is trained from data which consists of features (input data) and labels, that is target values. Using our training data, the parameters of our model will be adjusted until loss function reaches a low value or until we get high accuracy on the validation data. Before we start building model, we need make data preprocessing. PCA is often used, to reduce numbers of dimensions our data. Models in which data have been reduced using PCA often have high accuracy. In this article, we will look at how well-known classifiers work such as: K-Nearest Neighbors, Gaussian Naive Bayes and Support Vector Machines, that using PCA. We will also check the performance of the classifiers for which the data has been reduced to fewer dimensions by analyzing correlation tables and we will look at models whose data contain the original number of features. We will evaluate their effectiveness based on the Raisin database and show how decision boundaries built in models that were constructed after our analysis. Keywords Machine learning, pca, knn, gaussian naive bayes, svm, classifiers 1. Introduction KNN and Gaussian Naive Bayes there is no learning with weights. Using KNN model on large dataset, it can lead Supervised learning is used in many areas, such as: clas- to high consumption of computing resources. In [17] was sification [1, 2], regression [3], patterns recognition, nat- proposed strategy, which improve the efficiency of KNN ural language processing and image encryption [4, 5, 6]. classifier on Big Data. Examples of problems that can be solved using supervised learning are: classifying whether an email is spam or not, In this paper we will compare three classifiers: KNN, weather forecasting, classify text, whether a review is Gaussian Naive Bayes and SVM, that were built on three positive or negative[7, 8]. various data: The popular algorithm used during training model for • model uses PCA to reduce the dimensionality of example in the case of regression or SVM classifier, is the data the gradient descent, which minimizes loss function, by • model uses two features selected by us adjusting the parameters of our model in the direction • model uses all the features of the decreasing gradient of the loss function [9, 10]. We will check the effectiveness of the above models, in The goal of supervised learning is to achieve high accu- the case of the KNN for different metrics and for the SVM racy to make right predictions on unknown data. There model we will test the performance for various kernels. are many interesting improvements to such models for We will summarize whether reduction of the dimensions application systems. In [11] was presented how to use of our data allows us to get satisfactory results, leading machine learning for imbalanced data inputs, while in to a decrease in computational complexity. [12, 13] was presented positioning of technical systems for power electric models. We can find also many appli- cations for complex input data structure ie. [14] gave it for the graph based input relations compositions. 2. Raisin database In classification problem we also distinguish models The database that we used to build various classifiers based on deep neural networks[15]. The architecture contains samples that were described by 7 morphological of neural networks is: weights, activation functions [16], features. These features were obtained after previously loss function and optimizer. In the case of classifier processing the photos.Values are continuous and we can see that each feature has value from different ranges. SYSYEM 2023: 9th Scholar’s Yearly Symposium of Technology, Engi- There are also high values of standard deviations for neering and Mathematics, Rome, December 3-6, 2023 example, for Area and ConvexArea features, indicating $ ap307985@polsl.pl (A. Polowczyk); ap307986@polsl.pl that the values for these features are highly dispersed (A. Polowczyk) from their mean. © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 9 CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Agnieszka Polowczyk et al. CEUR Workshop Proceedings 9–16 2.1. Standardization Normalization or standardization are used to improve the efficiency and effectiveness of the model. In the case of KNN model, that uses distance measures to classify data samples, if the values weren’t normalized or standard- ization, features with higher values could have greater impact on model’s result, which could lead to low accu- racy. Therefore, an important and recommended action is to use one of the data processing techniques before cre- ating KNN and SVM models. Mainly for Gaussian Naive Bayes doesn’t use data standardization, because this al- gorithm doesn’t depend on distance, so doesn’t require Figure 1: Correlation graphs of two features scaling of features. We used standardization exception- ally in Gaussian Naive Bayes classifiers in which the PCA technique were used and in classifiers in which used two- 2.4. Model based on all features dimensional data to plot decision boundaries.We used standardization, which transforms our data in a way that For each classifier, we also built a model based on all the mean is equal to 0 and the standard deviation is equal seven features. Sometimes training a model on the basis to 1. At the beginning for everyone feature we calculated of all attributes can be a disadvantage, because this ap- the mean value and standard deviation and then we used proach lead to slower learning of our classifier. However, the results to calculate new values using the formula the advantage of including all features is that in some below: cases it can lead to very high efficiency of our machine 𝑥−𝜇 learning algorithm, because we don’t lose any relevant in- 𝑥𝑛𝑒𝑤 = (1) 𝜎 formation. Fig. ?? illustrates our feature and correlation graphs. 2.2. Model based on PCA One of the popular dimensionality reduction techniques 3. Methods is PCA. The task of PCA is to return n-features that we can create a model with high accuracy. Interesting im- 3.1. KNN provements to PCA models composed for graph based classifiers were presented in [18]. In our models were 3.1.1. Formulas used PCA, which returns to us new training and test data Euclidean distance: reduced from seven to two dimensions. ⎯ ⎸𝑚 ⎸∑︁ 𝐷(𝑥, 𝑦) = ⎷ (𝑥𝑖 − 𝑦𝑖 )2 (2) 2.3. Model based on two features 𝑖=1 Another way to prepare data for the model is to reduce Manhattan distance: dimensionality based on correlation analysis. Correlation 𝑚 defines the relation between two variables. Correlation ∑︁ 𝐷(𝑥, 𝑦) = |𝑥𝑖 − 𝑦𝑖 | (3) value close to 1 or -1 mean a strong correlation, but value 𝑖=1 close to 0 mean weak correlation. The Extent feature was removed from our training and testing data, because Minkowski distance: its correlation value with our target feature was only (︃ 𝑚 )︃ 1 𝑟 0.28. Additionally, the following features were elimi- ∑︁ 𝐷(𝑥, 𝑦) = |𝑥𝑖 − 𝑦𝑖 | 𝑟 (4) nated: ConvexArea, Perimeter, Area, MinorAxisLength, 𝑖=1 because these attributes had strong relation with other features and didn’t contribute relevant information to the Canberra distance: classification models. Finally, our classifiers were built 𝑚 on other two features: MajorAxisLength and Eccentricity. ∑︁ |𝑥𝑖 − 𝑦𝑖 | 𝐷(𝑥, 𝑦) = (5) The Fig. 1 shows correlation plots between two features. 𝑖=1 |𝑥 𝑖 | + |𝑦𝑖 | Chebyshev distance: 𝑚 𝐷(𝑥, 𝑦) = max |𝑥𝑖 − 𝑦𝑖 | (6) 𝑖=1 10 Agnieszka Polowczyk et al. CEUR Workshop Proceedings 9–16 Cosine distance: Updating weights and bias: ∑︀𝑚 𝑖=1 𝑥𝑖 · 𝑦𝑖 𝐷(𝑥, 𝑦) = 1 − √︀∑︀𝑚 2 √︀∑︀𝑚 2 (7) 𝑤𝑡 = 𝑤𝑡 − 𝜂∇𝑤 𝐶𝑜𝑠𝑡(𝑤𝑡 ) (12) 𝑖=1 𝑥𝑖 · 𝑖=1 𝑦𝑖 𝑏 = 𝑏 − 𝜂∇𝑏 𝐶𝑜𝑠𝑡(𝑤𝑡 ) (13) 3.1.2. Algorithm Minimizing the cost function using Stochastic Gradient KNN classifier is mathematical model, that doesn’t re- Descent (SGD): quire training. New, unknown points are predicted based 𝑚𝑖𝑛𝐶𝑜𝑠𝑡(𝑤𝑖 ) = 𝜆||𝑤||2 + max(0, 1 − 𝑦𝑖 𝑓 (𝑥𝑖 )) (14) on the k-nearest points voting. When classifying a new sample, model calculates distances between the sam- 1 ple and each point in the specified n-dimensional space. 𝜆= (15) 𝑁𝐶 Among all the distances the model chooses k-smallest and voting takes place. The class that occurs most fre- 3.3.2. Algorithm quently among the selected points becomes the predicted class for the new sample. We compared performance of SVM classifier (Support Vector Machines) creates a hy- KNN classifier for different distance measures (metrics): perplane that maximizes the distance between the closest Euclidean, Manhattan, Minkowski for 𝑟 = 3, Canberra, points of two classes (support vectors). When creating Chebyshev and Cosine. a hyperplane, two techniques are used: soft margin and hard margin. Soft margin during the process of train- ing allows our algorithms to make mistakes, so it allows 3.2. Gaussian Naive Bayes points to be on the wrong side of the hyperplane or inside 3.2.1. Formulas the margin. Hard margin during the process of training doesn’t tolerate any errors, so points cannot be on the Bayes’ Theorem in our model: wrong side of the hyperplane or inside the margin. In our 𝑃 (𝑦) · 𝑛 case, where our data isn’t completely linear separable, we ∏︀ 𝑖=1 𝑃 (𝑥𝑖 |𝑦) 𝑃 (𝑦|𝑥1 , 𝑥2 , ..., 𝑥𝑛 ) = (8) used the soft margin technique and used various kernels 𝑃 (𝑦|𝑥1 , 𝑥2 , ..., 𝑥𝑛 ) to transform our data to higher dimensionality. We also 1 (𝑥𝑖 − 𝜇𝑦 ) used regularization parameter C. We created models, in 𝑃 (𝑥𝑖 |𝑦) = √︀ · exp(− ) (9) 2𝜋𝜎𝑦2 2𝜎𝑦2 which one of the classes is equal to 1 and the other is Sample variance: equal to -1. Then, using Stochastic Gradient Descent, we 𝑛 updated our weights and b after each data sample. Finally, 1 ∑︁ 𝜎2 = (𝑥𝑖 − 𝑥 ¯ )2 (10) we tested our models, if the predicted values were nega- 𝑛 − 1 𝑖=1 tive, they were assigned the label -1, if non-negative, they are assigned the label 1. We compared the performance of classifiers using different kernels such as: linear, poly, 3.2.2. Algorithm rbf, laplacian and sigmoid. Gaussian Naive Bayes is probabilistic model, that uses Bayes’ Theorem to determine the probability of sample belonging to a specific class. The classifier assumes, that the features are independent. We used this type of clas- sifier, because our data is continuous and the data is approximately normally distributed. During the training process, our model calculated the mean and variance for each attribute from every class and the "a priori" proba- bilities for each class. When predicting a test data, two probabilities are returned because we have binary classi- fication. We chose the highest probability with its label. 3.3. SVM 3.3.1. Formulas Hinge Loss: 𝜀𝑖 = max(0, 1 − 𝑦𝑖 𝑓 (𝑥𝑖 )) (11) 11 Agnieszka Polowczyk et al. CEUR Workshop Proceedings 9–16 4. Experiments 4.1. KNN We created the KNN models for different metrics for which the prediction is based on 3 nearest neighbors. Figure 2: Classification reports for KNN with PCA. The results are shown in order for the metrics: euclidean, manhattan, minkowski, canberra, chebyshev, cosine Figure 3: Classification reports for KNN with two features. The results are shown in order for the metrics: euclidean, manhattan, minkowski, canberra, chebyshev, cosine 12 Agnieszka Polowczyk et al. CEUR Workshop Proceedings 9–16 Figure 4: Decision boundaries for KNN with two features. The results are presented in order for the metrics: euclidean, manhattan, minkowski, canberra, chebyshev, cosine Figure 5: Classification reports for KNN with all features. The results are shown in order for the metrics: euclidean, manhattan, minkowski, canberra, chebyshev, cosine 13 Agnieszka Polowczyk et al. CEUR Workshop Proceedings 9–16 4.2. Gaussian Naive Bayes Figure 6: Classification reports for Gaussian Naive Bayes in order using PCA and all features Figure 7: Classification report for Gaussian Naive Bayes with two features and decision boundaries of this model 4.3. SVM We created SVM models for different kernels for specific parameters. These nuclei are: linear, polynomial with degree of 7, rbf with a gamma of 2, laplacian with a gamma of 2 and sigmoid with a gamma of 1. Figure 8: Classification reports for SVM with PCA. The results are shown in order for the kernels: linear, poly, rbf, laplacian and sigmoid 14 Agnieszka Polowczyk et al. CEUR Workshop Proceedings 9–16 Figure 9: Classification reports for SVM with two features. The results are shown in order for the kernels: linear, poly, rbf, laplacian and sigmoid Figure 10: Classification reports for SVM with all features. The results are shown in order for the kernels: linear, poly, rbf, laplacian, sigmoid Figure 11: Decision boundaries for SVM with two features 15 Agnieszka Polowczyk et al. CEUR Workshop Proceedings 9–16 5. Conclusion studies to apply the theory of mind theory to green and smart mobility by using gaussian area cluster- After analyzing our results for three classifiers, we con- ing, volume 3118, 2021, pp. 71 – 76. clude that using the PCA technique to reduce the dimen- [8] V. Ponzi, S. Russo, A. Wajda, R. Brociek, C. Napoli, sionality from 7 to 2 features supported performance of Analysis pre and post covid-19 pandemic rorschach our models, also achieving high accuracies, comparable test data of using em algorithms and gmm models, to the results of models built on all features. After ana- volume 3360, 2022, pp. 55 – 63. lyzing the correlation of our data, we were able to find [9] D. P. Hapsari, I. Utoyo, S. W. Purnami, Fractional two features for which the models had similar accuracy gradient descent optimizer for linear classifier sup- compared to the PCA-based models, these features are: port vector machine, 2020 Third International Con- MajorAxisLength and Eccentricity. The accuracy for the ference on Vocational Education and Electrical En- KNN classifiers with and without PCA are very similar, gineering (ICVEE) (2020) 1–5. ranging from 82% to 88% depending on the metric. In [10] G. Capizzi, G. L. Sciuto, C. Napoli, R. Shikler, the case of Gaussian Naive Bayes classifiers the accu- M. Wozniak, Optimizing the organic solar cell racy result obtained using PCA and using 7 features gave manufacturing process by means of afm measure- the same value of 85% , which only confirms the fact ments and neural networks, Energies 11 (2018). that the reduction in dimensions didn’t contribute to the doi:10.3390/en11051221. loss of significant information. The last type of classifier, [11] M. Woźniak, M. Wieczorek, J. Siłka, Bilstm deep that was analyzed was SVM. After analyzing for different neural network model for imbalanced medical data kernels, the sigmoid kernel turned out to be the best, of iot systems, Future Generation Computer Sys- which in models with and without PCA indicated the tems 141 (2023) 489–499. best accuracy of 88%. [12] A. Sikora, A. Zielonka, M. F. Ijaz, M. Woźniak, Dig- ital twin heuristic positioning of insulation in mul- timodal electric systems, IEEE Transactions on References Consumer Electronics (2024). [13] G. Capizzi, C. Napoli, L. Paternò, An innova- [1] K. Thirunavukkarasu, A. S. Singh, P. Rai, S. Gupta, tive hybrid neuro-wavelet method for reconstruc- Classification of iris dataset using classification tion of missing data in astronomical photometric based knn algorithm in supervised learning, 2018 surveys 7267 LNAI (2012) 21 – 29. doi:10.1007/ 4th International Conference on Computing Com- 978-3-642-29347-4_3. munication and Automation (ICCCA) (2018) 1–4. [14] Q. Ke, X. Jing, M. Woźniak, S. Xu, Y. Liang, [2] G. De Magistris, R. Caprari, G. Castro, S. Russo, J. Zheng, Apgvae: Adaptive disentangled represen- L. Iocchi, D. Nardi, C. Napoli, Vision-based holis- tation learning with the graph-based structure in- tic scene understanding for context-aware human- formation, Information Sciences 657 (2024) 119903. robot interaction 13196 LNAI (2022) 310 – 325. [15] N. A. Al-Sammarraie, Y. M. H. Al-Mayali, Y. A. B. doi:10.1007/978-3-031-08421-8\_21. El-Ebiary, Classification and diagnosis using back [3] V. Amaresh, R. R. Singh, R. Kamal, A. Kulkarni, propagation artificial neural networks (ann), 2018 Linear regression models based housing price fore- International Conference on Smart Computing and casting, 2022 International Conference on Industry Electronic Enterprise (ICSCEE) (2018) 1–5. 4.0 Technology (I4Tech) (2022) 1–5. [16] D. Gangadia, Activation functions: experimenta- [4] W. Feng, J. Zhang, Y. Chen, Z. Qin, Y. Zhang, M. Ah- tion and comparison, 2021 6th International Confer- mad, M. Woźniak, Exploiting robust quadratic poly- ence for Convergence in Technology (I2CT) (2021) nomial hyperchaotic map and pixel fusion strategy 1–6. for efficient image encryption, Expert Systems with [17] P. H. Progga, M. J. Rahman, S. Biswas, M. S. Ahmed, Applications 246 (2024) 123190. D. M. Farid, K-nearest neighbour classifier for big [5] M. Wozniak, C. Napoli, E. Tramontana, G. Capizzi, data mining based on informative instances, 2023 G. Lo Sciuto, R. K. Nowicki, J. T. Starczewski, A mul- IEEE 8th International Conference for Convergence tiscale image compressor with rbfnn and discrete in Technology (I2CT) (2023) 1–7. wavelet decomposition, volume 2015-September, [18] W. Dong, M. Woźniak, J. Wu, W. Li, Z. Bai, Denois- 2015. doi:10.1109/IJCNN.2015.7280461. ing aggregation of graph neural networks by using [6] G. Capizzi, C. Napoli, S. Russo, M. Woźniak, Lessen- principal component analysis, IEEE Transactions ing stress and anxiety-related behaviors by means on Industrial Informatics 19 (2022) 2385–2394. of ai-driven drones for aromatherapy, volume 2594, 2020, pp. 7 – 12. [7] N. Brandizzi, S. Russo, R. Brociek, A. Wajda, First 16