=Paper=
{{Paper
|id=Vol-3695/p02
|storemode=property
|title=The Effectiveness of PCA in KNN, Gaussian Naive Bayes
Classifier and SVM for Raisin Dataset
|pdfUrl=https://ceur-ws.org/Vol-3695/p02.pdf
|volume=Vol-3695
|authors=Agnieszka Polowczyk,Alicja Polowczyk
|dblpUrl=https://dblp.org/rec/conf/system/PolowczykP23
}}
==The Effectiveness of PCA in KNN, Gaussian Naive Bayes
Classifier and SVM for Raisin Dataset==
The Effectiveness of PCA in KNN, Gaussian Naive Bayes
Classifier and SVM for Raisin Dataset
Agnieszka Polowczyk1 , Alicja Polowczyk1
1
Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44-100 Gliwice, Poland
Abstract
Supervised learning is one of the main types machine learning in which model is trained from data which consists of features
(input data) and labels, that is target values. Using our training data, the parameters of our model will be adjusted until
loss function reaches a low value or until we get high accuracy on the validation data. Before we start building model,
we need make data preprocessing. PCA is often used, to reduce numbers of dimensions our data. Models in which data
have been reduced using PCA often have high accuracy. In this article, we will look at how well-known classifiers work
such as: K-Nearest Neighbors, Gaussian Naive Bayes and Support Vector Machines, that using PCA. We will also check the
performance of the classifiers for which the data has been reduced to fewer dimensions by analyzing correlation tables and
we will look at models whose data contain the original number of features. We will evaluate their effectiveness based on the
Raisin database and show how decision boundaries built in models that were constructed after our analysis.
Keywords
Machine learning, pca, knn, gaussian naive bayes, svm, classifiers
1. Introduction KNN and Gaussian Naive Bayes there is no learning with
weights. Using KNN model on large dataset, it can lead
Supervised learning is used in many areas, such as: clas- to high consumption of computing resources. In [17] was
sification [1, 2], regression [3], patterns recognition, nat- proposed strategy, which improve the efficiency of KNN
ural language processing and image encryption [4, 5, 6]. classifier on Big Data.
Examples of problems that can be solved using supervised
learning are: classifying whether an email is spam or not, In this paper we will compare three classifiers: KNN,
weather forecasting, classify text, whether a review is Gaussian Naive Bayes and SVM, that were built on three
positive or negative[7, 8]. various data:
The popular algorithm used during training model for • model uses PCA to reduce the dimensionality of
example in the case of regression or SVM classifier, is the data
the gradient descent, which minimizes loss function, by • model uses two features selected by us
adjusting the parameters of our model in the direction • model uses all the features
of the decreasing gradient of the loss function [9, 10]. We will check the effectiveness of the above models, in
The goal of supervised learning is to achieve high accu- the case of the KNN for different metrics and for the SVM
racy to make right predictions on unknown data. There model we will test the performance for various kernels.
are many interesting improvements to such models for We will summarize whether reduction of the dimensions
application systems. In [11] was presented how to use of our data allows us to get satisfactory results, leading
machine learning for imbalanced data inputs, while in to a decrease in computational complexity.
[12, 13] was presented positioning of technical systems
for power electric models. We can find also many appli-
cations for complex input data structure ie. [14] gave it
for the graph based input relations compositions.
2. Raisin database
In classification problem we also distinguish models The database that we used to build various classifiers
based on deep neural networks[15]. The architecture contains samples that were described by 7 morphological
of neural networks is: weights, activation functions [16], features. These features were obtained after previously
loss function and optimizer. In the case of classifier processing the photos.Values are continuous and we can
see that each feature has value from different ranges.
SYSYEM 2023: 9th Scholar’s Yearly Symposium of Technology, Engi- There are also high values of standard deviations for
neering and Mathematics, Rome, December 3-6, 2023 example, for Area and ConvexArea features, indicating
$ ap307985@polsl.pl (A. Polowczyk); ap307986@polsl.pl that the values for these features are highly dispersed
(A. Polowczyk) from their mean.
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
9
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
Agnieszka Polowczyk et al. CEUR Workshop Proceedings 9–16
2.1. Standardization
Normalization or standardization are used to improve the
efficiency and effectiveness of the model. In the case of
KNN model, that uses distance measures to classify data
samples, if the values weren’t normalized or standard-
ization, features with higher values could have greater
impact on model’s result, which could lead to low accu-
racy. Therefore, an important and recommended action
is to use one of the data processing techniques before cre-
ating KNN and SVM models. Mainly for Gaussian Naive
Bayes doesn’t use data standardization, because this al-
gorithm doesn’t depend on distance, so doesn’t require Figure 1: Correlation graphs of two features
scaling of features. We used standardization exception-
ally in Gaussian Naive Bayes classifiers in which the PCA
technique were used and in classifiers in which used two- 2.4. Model based on all features
dimensional data to plot decision boundaries.We used
standardization, which transforms our data in a way that For each classifier, we also built a model based on all
the mean is equal to 0 and the standard deviation is equal seven features. Sometimes training a model on the basis
to 1. At the beginning for everyone feature we calculated of all attributes can be a disadvantage, because this ap-
the mean value and standard deviation and then we used proach lead to slower learning of our classifier. However,
the results to calculate new values using the formula the advantage of including all features is that in some
below: cases it can lead to very high efficiency of our machine
𝑥−𝜇 learning algorithm, because we don’t lose any relevant in-
𝑥𝑛𝑒𝑤 = (1)
𝜎 formation. Fig. ?? illustrates our feature and correlation
graphs.
2.2. Model based on PCA
One of the popular dimensionality reduction techniques 3. Methods
is PCA. The task of PCA is to return n-features that we
can create a model with high accuracy. Interesting im- 3.1. KNN
provements to PCA models composed for graph based
classifiers were presented in [18]. In our models were 3.1.1. Formulas
used PCA, which returns to us new training and test data
Euclidean distance:
reduced from seven to two dimensions. ⎯
⎸𝑚
⎸∑︁
𝐷(𝑥, 𝑦) = ⎷ (𝑥𝑖 − 𝑦𝑖 )2 (2)
2.3. Model based on two features 𝑖=1
Another way to prepare data for the model is to reduce Manhattan distance:
dimensionality based on correlation analysis. Correlation
𝑚
defines the relation between two variables. Correlation ∑︁
𝐷(𝑥, 𝑦) = |𝑥𝑖 − 𝑦𝑖 | (3)
value close to 1 or -1 mean a strong correlation, but value
𝑖=1
close to 0 mean weak correlation. The Extent feature
was removed from our training and testing data, because Minkowski distance:
its correlation value with our target feature was only (︃ 𝑚 )︃ 1
𝑟
0.28. Additionally, the following features were elimi- ∑︁
𝐷(𝑥, 𝑦) = |𝑥𝑖 − 𝑦𝑖 | 𝑟
(4)
nated: ConvexArea, Perimeter, Area, MinorAxisLength,
𝑖=1
because these attributes had strong relation with other
features and didn’t contribute relevant information to the Canberra distance:
classification models. Finally, our classifiers were built 𝑚
on other two features: MajorAxisLength and Eccentricity.
∑︁ |𝑥𝑖 − 𝑦𝑖 |
𝐷(𝑥, 𝑦) = (5)
The Fig. 1 shows correlation plots between two features. 𝑖=1
|𝑥 𝑖 | + |𝑦𝑖 |
Chebyshev distance:
𝑚
𝐷(𝑥, 𝑦) = max |𝑥𝑖 − 𝑦𝑖 | (6)
𝑖=1
10
Agnieszka Polowczyk et al. CEUR Workshop Proceedings 9–16
Cosine distance: Updating weights and bias:
∑︀𝑚
𝑖=1 𝑥𝑖 · 𝑦𝑖
𝐷(𝑥, 𝑦) = 1 − √︀∑︀𝑚 2
√︀∑︀𝑚 2 (7) 𝑤𝑡 = 𝑤𝑡 − 𝜂∇𝑤 𝐶𝑜𝑠𝑡(𝑤𝑡 ) (12)
𝑖=1 𝑥𝑖 · 𝑖=1 𝑦𝑖
𝑏 = 𝑏 − 𝜂∇𝑏 𝐶𝑜𝑠𝑡(𝑤𝑡 ) (13)
3.1.2. Algorithm Minimizing the cost function using Stochastic Gradient
KNN classifier is mathematical model, that doesn’t re- Descent (SGD):
quire training. New, unknown points are predicted based
𝑚𝑖𝑛𝐶𝑜𝑠𝑡(𝑤𝑖 ) = 𝜆||𝑤||2 + max(0, 1 − 𝑦𝑖 𝑓 (𝑥𝑖 )) (14)
on the k-nearest points voting. When classifying a new
sample, model calculates distances between the sam- 1
ple and each point in the specified n-dimensional space. 𝜆= (15)
𝑁𝐶
Among all the distances the model chooses k-smallest
and voting takes place. The class that occurs most fre-
3.3.2. Algorithm
quently among the selected points becomes the predicted
class for the new sample. We compared performance of SVM classifier (Support Vector Machines) creates a hy-
KNN classifier for different distance measures (metrics): perplane that maximizes the distance between the closest
Euclidean, Manhattan, Minkowski for 𝑟 = 3, Canberra, points of two classes (support vectors). When creating
Chebyshev and Cosine. a hyperplane, two techniques are used: soft margin and
hard margin. Soft margin during the process of train-
ing allows our algorithms to make mistakes, so it allows
3.2. Gaussian Naive Bayes points to be on the wrong side of the hyperplane or inside
3.2.1. Formulas the margin. Hard margin during the process of training
doesn’t tolerate any errors, so points cannot be on the
Bayes’ Theorem in our model: wrong side of the hyperplane or inside the margin. In our
𝑃 (𝑦) · 𝑛 case, where our data isn’t completely linear separable, we
∏︀
𝑖=1 𝑃 (𝑥𝑖 |𝑦)
𝑃 (𝑦|𝑥1 , 𝑥2 , ..., 𝑥𝑛 ) = (8) used the soft margin technique and used various kernels
𝑃 (𝑦|𝑥1 , 𝑥2 , ..., 𝑥𝑛 )
to transform our data to higher dimensionality. We also
1 (𝑥𝑖 − 𝜇𝑦 ) used regularization parameter C. We created models, in
𝑃 (𝑥𝑖 |𝑦) = √︀ · exp(− ) (9)
2𝜋𝜎𝑦2 2𝜎𝑦2 which one of the classes is equal to 1 and the other is
Sample variance: equal to -1. Then, using Stochastic Gradient Descent, we
𝑛 updated our weights and b after each data sample. Finally,
1 ∑︁
𝜎2 = (𝑥𝑖 − 𝑥 ¯ )2 (10) we tested our models, if the predicted values were nega-
𝑛 − 1 𝑖=1 tive, they were assigned the label -1, if non-negative, they
are assigned the label 1. We compared the performance
of classifiers using different kernels such as: linear, poly,
3.2.2. Algorithm rbf, laplacian and sigmoid.
Gaussian Naive Bayes is probabilistic model, that uses
Bayes’ Theorem to determine the probability of sample
belonging to a specific class. The classifier assumes, that
the features are independent. We used this type of clas-
sifier, because our data is continuous and the data is
approximately normally distributed. During the training
process, our model calculated the mean and variance for
each attribute from every class and the "a priori" proba-
bilities for each class. When predicting a test data, two
probabilities are returned because we have binary classi-
fication. We chose the highest probability with its label.
3.3. SVM
3.3.1. Formulas
Hinge Loss:
𝜀𝑖 = max(0, 1 − 𝑦𝑖 𝑓 (𝑥𝑖 )) (11)
11
Agnieszka Polowczyk et al. CEUR Workshop Proceedings 9–16
4. Experiments
4.1. KNN
We created the KNN models for different metrics for which the prediction is based on 3 nearest neighbors.
Figure 2: Classification reports for KNN with PCA. The results are shown in order for the metrics: euclidean, manhattan,
minkowski, canberra, chebyshev, cosine
Figure 3: Classification reports for KNN with two features. The results are shown in order for the metrics: euclidean,
manhattan, minkowski, canberra, chebyshev, cosine
12
Agnieszka Polowczyk et al. CEUR Workshop Proceedings 9–16
Figure 4: Decision boundaries for KNN with two features. The results are presented in order for the metrics: euclidean,
manhattan, minkowski, canberra, chebyshev, cosine
Figure 5: Classification reports for KNN with all features. The results are shown in order for the metrics: euclidean, manhattan,
minkowski, canberra, chebyshev, cosine
13
Agnieszka Polowczyk et al. CEUR Workshop Proceedings 9–16
4.2. Gaussian Naive Bayes
Figure 6: Classification reports for Gaussian Naive Bayes in order using PCA and all features
Figure 7: Classification report for Gaussian Naive Bayes with two features and decision boundaries of this model
4.3. SVM
We created SVM models for different kernels for specific parameters. These nuclei are: linear, polynomial with
degree of 7, rbf with a gamma of 2, laplacian with a gamma of 2 and sigmoid with a gamma of 1.
Figure 8: Classification reports for SVM with PCA. The results are shown in order for the kernels: linear, poly, rbf, laplacian
and sigmoid
14
Agnieszka Polowczyk et al. CEUR Workshop Proceedings 9–16
Figure 9: Classification reports for SVM with two features. The results are shown in order for the kernels: linear, poly, rbf,
laplacian and sigmoid
Figure 10: Classification reports for SVM with all features. The results are shown in order for the kernels: linear, poly, rbf,
laplacian, sigmoid
Figure 11: Decision boundaries for SVM with two features
15
Agnieszka Polowczyk et al. CEUR Workshop Proceedings 9–16
5. Conclusion studies to apply the theory of mind theory to green
and smart mobility by using gaussian area cluster-
After analyzing our results for three classifiers, we con- ing, volume 3118, 2021, pp. 71 – 76.
clude that using the PCA technique to reduce the dimen- [8] V. Ponzi, S. Russo, A. Wajda, R. Brociek, C. Napoli,
sionality from 7 to 2 features supported performance of Analysis pre and post covid-19 pandemic rorschach
our models, also achieving high accuracies, comparable test data of using em algorithms and gmm models,
to the results of models built on all features. After ana- volume 3360, 2022, pp. 55 – 63.
lyzing the correlation of our data, we were able to find [9] D. P. Hapsari, I. Utoyo, S. W. Purnami, Fractional
two features for which the models had similar accuracy gradient descent optimizer for linear classifier sup-
compared to the PCA-based models, these features are: port vector machine, 2020 Third International Con-
MajorAxisLength and Eccentricity. The accuracy for the ference on Vocational Education and Electrical En-
KNN classifiers with and without PCA are very similar, gineering (ICVEE) (2020) 1–5.
ranging from 82% to 88% depending on the metric. In [10] G. Capizzi, G. L. Sciuto, C. Napoli, R. Shikler,
the case of Gaussian Naive Bayes classifiers the accu- M. Wozniak, Optimizing the organic solar cell
racy result obtained using PCA and using 7 features gave manufacturing process by means of afm measure-
the same value of 85% , which only confirms the fact ments and neural networks, Energies 11 (2018).
that the reduction in dimensions didn’t contribute to the doi:10.3390/en11051221.
loss of significant information. The last type of classifier, [11] M. Woźniak, M. Wieczorek, J. Siłka, Bilstm deep
that was analyzed was SVM. After analyzing for different neural network model for imbalanced medical data
kernels, the sigmoid kernel turned out to be the best, of iot systems, Future Generation Computer Sys-
which in models with and without PCA indicated the tems 141 (2023) 489–499.
best accuracy of 88%. [12] A. Sikora, A. Zielonka, M. F. Ijaz, M. Woźniak, Dig-
ital twin heuristic positioning of insulation in mul-
timodal electric systems, IEEE Transactions on
References Consumer Electronics (2024).
[13] G. Capizzi, C. Napoli, L. Paternò, An innova-
[1] K. Thirunavukkarasu, A. S. Singh, P. Rai, S. Gupta,
tive hybrid neuro-wavelet method for reconstruc-
Classification of iris dataset using classification
tion of missing data in astronomical photometric
based knn algorithm in supervised learning, 2018
surveys 7267 LNAI (2012) 21 – 29. doi:10.1007/
4th International Conference on Computing Com-
978-3-642-29347-4_3.
munication and Automation (ICCCA) (2018) 1–4.
[14] Q. Ke, X. Jing, M. Woźniak, S. Xu, Y. Liang,
[2] G. De Magistris, R. Caprari, G. Castro, S. Russo,
J. Zheng, Apgvae: Adaptive disentangled represen-
L. Iocchi, D. Nardi, C. Napoli, Vision-based holis-
tation learning with the graph-based structure in-
tic scene understanding for context-aware human-
formation, Information Sciences 657 (2024) 119903.
robot interaction 13196 LNAI (2022) 310 – 325.
[15] N. A. Al-Sammarraie, Y. M. H. Al-Mayali, Y. A. B.
doi:10.1007/978-3-031-08421-8\_21.
El-Ebiary, Classification and diagnosis using back
[3] V. Amaresh, R. R. Singh, R. Kamal, A. Kulkarni,
propagation artificial neural networks (ann), 2018
Linear regression models based housing price fore-
International Conference on Smart Computing and
casting, 2022 International Conference on Industry
Electronic Enterprise (ICSCEE) (2018) 1–5.
4.0 Technology (I4Tech) (2022) 1–5.
[16] D. Gangadia, Activation functions: experimenta-
[4] W. Feng, J. Zhang, Y. Chen, Z. Qin, Y. Zhang, M. Ah-
tion and comparison, 2021 6th International Confer-
mad, M. Woźniak, Exploiting robust quadratic poly-
ence for Convergence in Technology (I2CT) (2021)
nomial hyperchaotic map and pixel fusion strategy
1–6.
for efficient image encryption, Expert Systems with
[17] P. H. Progga, M. J. Rahman, S. Biswas, M. S. Ahmed,
Applications 246 (2024) 123190.
D. M. Farid, K-nearest neighbour classifier for big
[5] M. Wozniak, C. Napoli, E. Tramontana, G. Capizzi,
data mining based on informative instances, 2023
G. Lo Sciuto, R. K. Nowicki, J. T. Starczewski, A mul-
IEEE 8th International Conference for Convergence
tiscale image compressor with rbfnn and discrete
in Technology (I2CT) (2023) 1–7.
wavelet decomposition, volume 2015-September,
[18] W. Dong, M. Woźniak, J. Wu, W. Li, Z. Bai, Denois-
2015. doi:10.1109/IJCNN.2015.7280461.
ing aggregation of graph neural networks by using
[6] G. Capizzi, C. Napoli, S. Russo, M. Woźniak, Lessen-
principal component analysis, IEEE Transactions
ing stress and anxiety-related behaviors by means
on Industrial Informatics 19 (2022) 2385–2394.
of ai-driven drones for aromatherapy, volume 2594,
2020, pp. 7 – 12.
[7] N. Brandizzi, S. Russo, R. Brociek, A. Wajda, First
16