The Effectiveness of PCA in KNN, Gaussian Naive Bayes
                                Classifier and SVM for Raisin Dataset
                                Agnieszka Polowczyk1 , Alicja Polowczyk1
                                1
                                    Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44-100 Gliwice, Poland


                                                  Abstract
                                                  Supervised learning is one of the main types machine learning in which model is trained from data which consists of features
                                                  (input data) and labels, that is target values. Using our training data, the parameters of our model will be adjusted until
                                                  loss function reaches a low value or until we get high accuracy on the validation data. Before we start building model,
                                                  we need make data preprocessing. PCA is often used, to reduce numbers of dimensions our data. Models in which data
                                                  have been reduced using PCA often have high accuracy. In this article, we will look at how well-known classifiers work
                                                  such as: K-Nearest Neighbors, Gaussian Naive Bayes and Support Vector Machines, that using PCA. We will also check the
                                                  performance of the classifiers for which the data has been reduced to fewer dimensions by analyzing correlation tables and
                                                  we will look at models whose data contain the original number of features. We will evaluate their effectiveness based on the
                                                  Raisin database and show how decision boundaries built in models that were constructed after our analysis.

                                                  Keywords
                                                  Machine learning, pca, knn, gaussian naive bayes, svm, classifiers


                                1. Introduction                                               KNN and Gaussian Naive Bayes there is no learning with
                                                                                              weights. Using KNN model on large dataset, it can lead
                                Supervised learning is used in many areas, such as: clas- to high consumption of computing resources. In [17] was
                                sification [1, 2], regression [3], patterns recognition, nat- proposed strategy, which improve the efficiency of KNN
                                ural language processing and image encryption [4, 5, 6]. classifier on Big Data.
                                Examples of problems that can be solved using supervised
                                learning are: classifying whether an email is spam or not, In this paper we will compare three classifiers: KNN,
                                weather forecasting, classify text, whether a review is Gaussian Naive Bayes and SVM, that were built on three
                                positive or negative[7, 8].                                   various data:

                                The popular algorithm used during training model for                                   • model uses PCA to reduce the dimensionality of
                                example in the case of regression or SVM classifier, is                                  the data
                                the gradient descent, which minimizes loss function, by                                • model uses two features selected by us
                                adjusting the parameters of our model in the direction                                 • model uses all the features
                                of the decreasing gradient of the loss function [9, 10].                         We will check the effectiveness of the above models, in
                                The goal of supervised learning is to achieve high accu-                         the case of the KNN for different metrics and for the SVM
                                racy to make right predictions on unknown data. There                            model we will test the performance for various kernels.
                                are many interesting improvements to such models for                             We will summarize whether reduction of the dimensions
                                application systems. In [11] was presented how to use                            of our data allows us to get satisfactory results, leading
                                machine learning for imbalanced data inputs, while in                            to a decrease in computational complexity.
                                [12, 13] was presented positioning of technical systems
                                for power electric models. We can find also many appli-
                                cations for complex input data structure ie. [14] gave it
                                for the graph based input relations compositions.
                                                                                                                 2. Raisin database
                                In classification problem we also distinguish models The database that we used to build various classifiers
                                based on deep neural networks[15]. The architecture contains samples that were described by 7 morphological
                                of neural networks is: weights, activation functions [16], features. These features were obtained after previously
                                loss function and optimizer. In the case of classifier processing the photos.Values are continuous and we can
                                                                                                                                      see that each feature has value from different ranges.
                                SYSYEM 2023: 9th Scholar’s Yearly Symposium of Technology, Engi- There are also high values of standard deviations for
                                neering and Mathematics, Rome, December 3-6, 2023                                                     example, for Area and ConvexArea features, indicating
                                $ ap307985@polsl.pl (A. Polowczyk); ap307986@polsl.pl                                                 that the values for these features are highly dispersed
                                (A. Polowczyk)                                                                                        from their mean.
                                         © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).


                                                                                                             9


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Agnieszka Polowczyk et al. CEUR Workshop Proceedings                                                                9–16


2.1. Standardization
Normalization or standardization are used to improve the
efficiency and effectiveness of the model. In the case of
KNN model, that uses distance measures to classify data
samples, if the values weren’t normalized or standard-
ization, features with higher values could have greater
impact on model’s result, which could lead to low accu-
racy. Therefore, an important and recommended action
is to use one of the data processing techniques before cre-
ating KNN and SVM models. Mainly for Gaussian Naive
Bayes doesn’t use data standardization, because this al-
gorithm doesn’t depend on distance, so doesn’t require         Figure 1: Correlation graphs of two features
scaling of features. We used standardization exception-
ally in Gaussian Naive Bayes classifiers in which the PCA
technique were used and in classifiers in which used two-      2.4. Model based on all features
dimensional data to plot decision boundaries.We used
standardization, which transforms our data in a way that       For each classifier, we also built a model based on all
the mean is equal to 0 and the standard deviation is equal     seven features. Sometimes training a model on the basis
to 1. At the beginning for everyone feature we calculated      of all attributes can be a disadvantage, because this ap-
the mean value and standard deviation and then we used         proach lead to slower learning of our classifier. However,
the results to calculate new values using the formula          the advantage of including all features is that in some
below:                                                         cases it can lead to very high efficiency of our machine
                               𝑥−𝜇                             learning algorithm, because we don’t lose any relevant in-
                      𝑥𝑛𝑒𝑤 =                            (1)
                                  𝜎                            formation. Fig. ?? illustrates our feature and correlation
                                                               graphs.
2.2. Model based on PCA
One of the popular dimensionality reduction techniques 3. Methods
is PCA. The task of PCA is to return n-features that we
can create a model with high accuracy. Interesting im- 3.1. KNN
provements to PCA models composed for graph based
classifiers were presented in [18]. In our models were 3.1.1. Formulas
used PCA, which returns to us new training and test data
                                                         Euclidean distance:
reduced from seven to two dimensions.                                                  ⎯
                                                                                       ⎸𝑚
                                                                                       ⎸∑︁
                                                                             𝐷(𝑥, 𝑦) = ⎷ (𝑥𝑖 − 𝑦𝑖 )2                  (2)
2.3. Model based on two features                                                            𝑖=1

Another way to prepare data for the model is to reduce Manhattan distance:
dimensionality based on correlation analysis. Correlation
                                                                                      𝑚
defines the relation between two variables. Correlation                              ∑︁
                                                                          𝐷(𝑥, 𝑦) =       |𝑥𝑖 − 𝑦𝑖 |                  (3)
value close to 1 or -1 mean a strong correlation, but value
                                                                                     𝑖=1
close to 0 mean weak correlation. The Extent feature
was removed from our training and testing data, because Minkowski distance:
its correlation value with our target feature was only                           (︃ 𝑚                 )︃ 1
                                                                                                         𝑟
0.28. Additionally, the following features were elimi-                             ∑︁
                                                                      𝐷(𝑥, 𝑦) =         |𝑥𝑖 − 𝑦𝑖 |  𝑟
                                                                                                                      (4)
nated: ConvexArea, Perimeter, Area, MinorAxisLength,
                                                                                   𝑖=1
because these attributes had strong relation with other
features and didn’t contribute relevant information to the Canberra distance:
classification models. Finally, our classifiers were built                           𝑚
on other two features: MajorAxisLength and Eccentricity.
                                                                                   ∑︁     |𝑥𝑖 − 𝑦𝑖 |
                                                                         𝐷(𝑥, 𝑦) =                                    (5)
The Fig. 1 shows correlation plots between two features.                            𝑖=1
                                                                                         |𝑥 𝑖 | + |𝑦𝑖 |


                                                               Chebyshev distance:
                                                                                            𝑚
                                                                               𝐷(𝑥, 𝑦) = max |𝑥𝑖 − 𝑦𝑖 |               (6)
                                                                                            𝑖=1


                                                          10
Agnieszka Polowczyk et al. CEUR Workshop Proceedings                                                                     9–16


Cosine distance:                                                      Updating weights and bias:
                                ∑︀𝑚
                                   𝑖=1 𝑥𝑖 · 𝑦𝑖
        𝐷(𝑥, 𝑦) = 1 − √︀∑︀𝑚          2
                                        √︀∑︀𝑚 2            (7)                     𝑤𝑡 = 𝑤𝑡 − 𝜂∇𝑤 𝐶𝑜𝑠𝑡(𝑤𝑡 )                (12)
                               𝑖=1 𝑥𝑖 ·       𝑖=1 𝑦𝑖
                                                                                     𝑏 = 𝑏 − 𝜂∇𝑏 𝐶𝑜𝑠𝑡(𝑤𝑡 )                (13)
3.1.2. Algorithm                                                  Minimizing the cost function using Stochastic Gradient
KNN classifier is mathematical model, that doesn’t re- Descent (SGD):
quire training. New, unknown points are predicted based
                                                                    𝑚𝑖𝑛𝐶𝑜𝑠𝑡(𝑤𝑖 ) = 𝜆||𝑤||2 + max(0, 1 − 𝑦𝑖 𝑓 (𝑥𝑖 )) (14)
on the k-nearest points voting. When classifying a new
sample, model calculates distances between the sam-                                               1
ple and each point in the specified n-dimensional space.                                    𝜆=                             (15)
                                                                                                 𝑁𝐶
Among all the distances the model chooses k-smallest
and voting takes place. The class that occurs most fre-
                                                                  3.3.2. Algorithm
quently among the selected points becomes the predicted
class for the new sample. We compared performance of SVM classifier (Support Vector Machines) creates a hy-
KNN classifier for different distance measures (metrics): perplane that maximizes the distance between the closest
Euclidean, Manhattan, Minkowski for 𝑟 = 3, Canberra, points of two classes (support vectors). When creating
Chebyshev and Cosine.                                             a hyperplane, two techniques are used: soft margin and
                                                                  hard margin. Soft margin during the process of train-
                                                                  ing allows our algorithms to make mistakes, so it allows
3.2. Gaussian Naive Bayes                                         points to be on the wrong side of the hyperplane or inside
3.2.1. Formulas                                                   the margin. Hard margin during the process of training
                                                                  doesn’t tolerate any errors, so points cannot be on the
Bayes’ Theorem in our model:                                      wrong side of the hyperplane or inside the margin. In our
                                  𝑃 (𝑦) · 𝑛                       case, where our data isn’t completely linear separable, we
                                          ∏︀
                                             𝑖=1 𝑃 (𝑥𝑖 |𝑦)
       𝑃 (𝑦|𝑥1 , 𝑥2 , ..., 𝑥𝑛 ) =                             (8) used the soft margin technique and used various kernels
                                    𝑃 (𝑦|𝑥1 , 𝑥2 , ..., 𝑥𝑛 )
                                                                  to transform our data to higher dimensionality. We also
                             1             (𝑥𝑖 − 𝜇𝑦 )             used regularization parameter C. We created models, in
         𝑃 (𝑥𝑖 |𝑦) = √︀           · exp(−                 )   (9)
                            2𝜋𝜎𝑦2               2𝜎𝑦2              which one of the classes is equal to 1 and the other is
Sample variance:                                                  equal to -1. Then, using Stochastic Gradient Descent, we
                                   𝑛                              updated our weights and b after each data sample. Finally,
                             1 ∑︁
                 𝜎2 =                (𝑥𝑖 − 𝑥 ¯ )2            (10) we tested our models, if the predicted values were nega-
                           𝑛 − 1 𝑖=1                              tive, they were assigned the label -1, if non-negative, they
                                                                  are assigned the label 1. We compared the performance
                                                                  of classifiers using different kernels such as: linear, poly,
3.2.2. Algorithm                                                  rbf, laplacian and sigmoid.
Gaussian Naive Bayes is probabilistic model, that uses
Bayes’ Theorem to determine the probability of sample
belonging to a specific class. The classifier assumes, that
the features are independent. We used this type of clas-
sifier, because our data is continuous and the data is
approximately normally distributed. During the training
process, our model calculated the mean and variance for
each attribute from every class and the "a priori" proba-
bilities for each class. When predicting a test data, two
probabilities are returned because we have binary classi-
fication. We chose the highest probability with its label.


3.3. SVM
3.3.1. Formulas
Hinge Loss:
                𝜀𝑖 = max(0, 1 − 𝑦𝑖 𝑓 (𝑥𝑖 ))              (11)


                                                                 11
Agnieszka Polowczyk et al. CEUR Workshop Proceedings                                                               9–16


4. Experiments
4.1. KNN
We created the KNN models for different metrics for which the prediction is based on 3 nearest neighbors.


Figure 2: Classification reports for KNN with PCA. The results are shown in order for the metrics: euclidean, manhattan,
minkowski, canberra, chebyshev, cosine


Figure 3: Classification reports for KNN with two features. The results are shown in order for the metrics: euclidean,
manhattan, minkowski, canberra, chebyshev, cosine


                                                          12
Agnieszka Polowczyk et al. CEUR Workshop Proceedings                                                                       9–16


Figure 4: Decision boundaries for KNN with two features. The results are presented in order for the metrics: euclidean,
manhattan, minkowski, canberra, chebyshev, cosine


Figure 5: Classification reports for KNN with all features. The results are shown in order for the metrics: euclidean, manhattan,
minkowski, canberra, chebyshev, cosine


                                                               13
Agnieszka Polowczyk et al. CEUR Workshop Proceedings                                                                      9–16


4.2. Gaussian Naive Bayes


Figure 6: Classification reports for Gaussian Naive Bayes in order using PCA and all features


Figure 7: Classification report for Gaussian Naive Bayes with two features and decision boundaries of this model


4.3. SVM
We created SVM models for different kernels for specific parameters. These nuclei are: linear, polynomial with
degree of 7, rbf with a gamma of 2, laplacian with a gamma of 2 and sigmoid with a gamma of 1.


Figure 8: Classification reports for SVM with PCA. The results are shown in order for the kernels: linear, poly, rbf, laplacian
and sigmoid


                                                              14
Agnieszka Polowczyk et al. CEUR Workshop Proceedings                                                                     9–16


Figure 9: Classification reports for SVM with two features. The results are shown in order for the kernels: linear, poly, rbf,
laplacian and sigmoid


Figure 10: Classification reports for SVM with all features. The results are shown in order for the kernels: linear, poly, rbf,
laplacian, sigmoid


Figure 11: Decision boundaries for SVM with two features


                                                              15
Agnieszka Polowczyk et al. CEUR Workshop Proceedings                                                                  9–16


5. Conclusion                                                         studies to apply the theory of mind theory to green
                                                                      and smart mobility by using gaussian area cluster-
After analyzing our results for three classifiers, we con-            ing, volume 3118, 2021, pp. 71 – 76.
clude that using the PCA technique to reduce the dimen-           [8] V. Ponzi, S. Russo, A. Wajda, R. Brociek, C. Napoli,
sionality from 7 to 2 features supported performance of               Analysis pre and post covid-19 pandemic rorschach
our models, also achieving high accuracies, comparable                test data of using em algorithms and gmm models,
to the results of models built on all features. After ana-            volume 3360, 2022, pp. 55 – 63.
lyzing the correlation of our data, we were able to find          [9] D. P. Hapsari, I. Utoyo, S. W. Purnami, Fractional
two features for which the models had similar accuracy                gradient descent optimizer for linear classifier sup-
compared to the PCA-based models, these features are:                 port vector machine, 2020 Third International Con-
MajorAxisLength and Eccentricity. The accuracy for the                ference on Vocational Education and Electrical En-
KNN classifiers with and without PCA are very similar,                gineering (ICVEE) (2020) 1–5.
ranging from 82% to 88% depending on the metric. In              [10] G. Capizzi, G. L. Sciuto, C. Napoli, R. Shikler,
the case of Gaussian Naive Bayes classifiers the accu-                M. Wozniak, Optimizing the organic solar cell
racy result obtained using PCA and using 7 features gave              manufacturing process by means of afm measure-
the same value of 85% , which only confirms the fact                  ments and neural networks, Energies 11 (2018).
that the reduction in dimensions didn’t contribute to the             doi:10.3390/en11051221.
loss of significant information. The last type of classifier,    [11] M. Woźniak, M. Wieczorek, J. Siłka, Bilstm deep
that was analyzed was SVM. After analyzing for different              neural network model for imbalanced medical data
kernels, the sigmoid kernel turned out to be the best,                of iot systems, Future Generation Computer Sys-
which in models with and without PCA indicated the                    tems 141 (2023) 489–499.
best accuracy of 88%.                                            [12] A. Sikora, A. Zielonka, M. F. Ijaz, M. Woźniak, Dig-
                                                                      ital twin heuristic positioning of insulation in mul-
                                                                      timodal electric systems, IEEE Transactions on
References                                                            Consumer Electronics (2024).
                                                                 [13] G. Capizzi, C. Napoli, L. Paternò, An innova-
 [1] K. Thirunavukkarasu, A. S. Singh, P. Rai, S. Gupta,
                                                                      tive hybrid neuro-wavelet method for reconstruc-
     Classification of iris dataset using classification
                                                                      tion of missing data in astronomical photometric
     based knn algorithm in supervised learning, 2018
                                                                      surveys 7267 LNAI (2012) 21 – 29. doi:10.1007/
     4th International Conference on Computing Com-
                                                                      978-3-642-29347-4_3.
     munication and Automation (ICCCA) (2018) 1–4.
                                                                 [14] Q. Ke, X. Jing, M. Woźniak, S. Xu, Y. Liang,
 [2] G. De Magistris, R. Caprari, G. Castro, S. Russo,
                                                                      J. Zheng, Apgvae: Adaptive disentangled represen-
     L. Iocchi, D. Nardi, C. Napoli, Vision-based holis-
                                                                      tation learning with the graph-based structure in-
     tic scene understanding for context-aware human-
                                                                      formation, Information Sciences 657 (2024) 119903.
     robot interaction 13196 LNAI (2022) 310 – 325.
                                                                 [15] N. A. Al-Sammarraie, Y. M. H. Al-Mayali, Y. A. B.
     doi:10.1007/978-3-031-08421-8\_21.
                                                                      El-Ebiary, Classification and diagnosis using back
 [3] V. Amaresh, R. R. Singh, R. Kamal, A. Kulkarni,
                                                                      propagation artificial neural networks (ann), 2018
     Linear regression models based housing price fore-
                                                                      International Conference on Smart Computing and
     casting, 2022 International Conference on Industry
                                                                      Electronic Enterprise (ICSCEE) (2018) 1–5.
     4.0 Technology (I4Tech) (2022) 1–5.
                                                                 [16] D. Gangadia, Activation functions: experimenta-
 [4] W. Feng, J. Zhang, Y. Chen, Z. Qin, Y. Zhang, M. Ah-
                                                                      tion and comparison, 2021 6th International Confer-
     mad, M. Woźniak, Exploiting robust quadratic poly-
                                                                      ence for Convergence in Technology (I2CT) (2021)
     nomial hyperchaotic map and pixel fusion strategy
                                                                      1–6.
     for efficient image encryption, Expert Systems with
                                                                 [17] P. H. Progga, M. J. Rahman, S. Biswas, M. S. Ahmed,
     Applications 246 (2024) 123190.
                                                                      D. M. Farid, K-nearest neighbour classifier for big
 [5] M. Wozniak, C. Napoli, E. Tramontana, G. Capizzi,
                                                                      data mining based on informative instances, 2023
     G. Lo Sciuto, R. K. Nowicki, J. T. Starczewski, A mul-
                                                                      IEEE 8th International Conference for Convergence
     tiscale image compressor with rbfnn and discrete
                                                                      in Technology (I2CT) (2023) 1–7.
     wavelet decomposition, volume 2015-September,
                                                                 [18] W. Dong, M. Woźniak, J. Wu, W. Li, Z. Bai, Denois-
     2015. doi:10.1109/IJCNN.2015.7280461.
                                                                      ing aggregation of graph neural networks by using
 [6] G. Capizzi, C. Napoli, S. Russo, M. Woźniak, Lessen-
                                                                      principal component analysis, IEEE Transactions
     ing stress and anxiety-related behaviors by means
                                                                      on Industrial Informatics 19 (2022) 2385–2394.
     of ai-driven drones for aromatherapy, volume 2594,
     2020, pp. 7 – 12.
 [7] N. Brandizzi, S. Russo, R. Brociek, A. Wajda, First


                                                            16