Analysis the performance of Naive Bayes and K-Nearest Neighbor Classifiers *

Analysis the performance of Naive Bayes and K-Nearest Neighbor Classifiers * HubertBojda Faculty of Applied Mathematics Silesian University of Technology

Kaszubska 23 44100 Gliwice POLAND

DawidGala Faculty of Applied Mathematics Silesian University of Technology

Kaszubska 23 44100 Gliwice POLAND

Information Society University Studies

2024, May 17 Kaunas Lithuania

Analysis the performance of Naive Bayes and K-Nearest Neighbor Classifiers * 1613-0073 647F105A8CBF3D211D514CB99CB15E5D GROBID - A machine learning software for extracting information from scholarly documents artificial intelligence London weather data dataset machine learning algorithms K-Nearest Neighbors (KNN) Naive Bayes accuracy F1-score

In our study, we implemented and compared two machine learning algorithms: K-Nearest Neighbors (KNN) and Naive Bayes. For each algorithm, we conducted 10 test runs to evaluate their performance. The results indicated that the KNN algorithm achieved an accuracy ranging from 0.80 to 0.82, demonstrating its robustness in predicting weather conditions based on the London's historical weather data. On the other hand, the Naive Bayes algorithm achieved an accuracy ranging from 0.74 to 0.76. Although slightly lower than KNN, these results still reflect the Naive Bayes algorithm's effectiveness in handling the weather data. Overall, this analysis provides valuable insights into the predictive capabilities of these algorithms.

Introduction

Artificial intelligence methods show us examples and uses of machine learning algorithms. This is very important in today's world, because more and more systems have more or less developed ai algorithms implemented. For example, they can be used for deep neural network models for unbalanced medical data of IoT systems [1] or to predict COVID-19 virus spread [2] This artificial intelligence system was developed to explore and validate the effectiveness of the K-Nearest Neighbors (KNN) and Gaussian Naive Bayes algorithms. To achieve this, we selected a weather database, which is particularly well-suited for testing these algorithms due to its mix of numerical and categorical data. The database includes columns with numerical values such as temperature, humidity, and wind speed, alongside a column containing categorical information about weather conditions at the time of observation, including categories like 'Clear', 'Overcast', and 'Foggy'. This rich and diverse dataset facilitates effective training and testing, enabling a thorough evaluation of the algorithms' performance.

The numerical data is good for the KNN algorithm, which predicts outcomes based on the distance between data points. For KNN, we use Euclidean distance measure to find the closest neighbors to a given data point and make predictions based on these neighbors. On the other hand, the categorical weather classifications are well-suited for the Naive Bayes algorithm. Naive Bayes works by calculating the probability of each class based on the feature distributions and assumes that features are independent given the class label, making it efficient for categorical data.

We then divided the dataset into a 70:30 ratio for training and testing. This split provides a substantial amount of data for training the models while reserving enough data to accurately assess their performance. Our benchmark tests involved evaluating the algorithms using standard performance metrics such as accuracy, precision, recall, and F1-score. These metrics offer a comprehensive view of the algorithms' ability to classify weather conditions correctly. Additionally, we performed cross-validation to ensure that our results were not overly dependent on a particular train-test split, further validating the robustness and reliability of our models.

Methodology

K Nearest Neighbors

Description

The KNN classifier [3], or k nearest neighbor algorithm, is used to classify and predict the value based on the variable specified in the decision column in the database. The algorithm compares the values in the columns that explain the phenomenon with the values of the variables that are included in the learning set. It contains information about the k closest observations from the learning set.

An important aspect in the creation of a classifier is the selection of an appropriate metric that calculates the distance between the observations of the learning set and the training set. The most popular metrics are Euclidean, Minkowski or Manhattan.

With successive iterations, the division of the data is corrected against the given metric. The algorithm moves data between classes so that the variance within each class is as smallest

Formulas

Calculating Distance Between Points:

The Euclidean distance 𝑑 between two points 𝑥 𝑖 = (𝑥 𝑖1 , 𝑥 𝑖2 , . . . , 𝑥 𝑖𝑛 ) and 𝑥 𝑗 = (𝑥 𝑗1 , 𝑥 𝑗2 , . . . , 𝑥 𝑗𝑛 ) is given by:

Finding Nearest Neighbors

To find the 𝑘 nearest neighbors for a test point, compute the Euclidean distances from the test point to all points in the training set and select the 𝑘 points with the smallest distances.

Classification by Majority Voting

For classification, the class of the test point is determined by the classes of its 𝑘 nearest neighbors. The class 𝐶 of the test point is given by: where 1(𝑦 𝑖 = 𝑐) is an indicator function that equals 1 if 𝑦 𝑖 = 𝑐 and 0 otherwise.

Classifier Algorithm

The KNN classifier algorithms are shown below:

Algorithm 1 KNN Algorithm Require: 𝑋_ , 𝑡𝑟𝑎𝑖𝑛 𝑦_ , 𝑡𝑟𝑎𝑖𝑛 𝑋_ , 𝑡𝑒𝑠𝑡 𝑦_𝑡𝑒𝑠𝑡 1: 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 ← [] 2: for 𝑥 in 𝑋_𝑡𝑒𝑠𝑡 do 3:

Calculate and sort distances from 𝑥 to 𝑋_𝑡𝑟𝑎𝑖𝑛.

Select 𝑘 nearest neighbors' labels.

Perform majority voting to determine the most frequent label.

Add the most frequent label to 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖 𝑜𝑛𝑠.

Calculate the accuracy by comparing 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 to 𝑦_𝑡𝑒𝑠𝑡.

Naive Bayes

Description

Before describing Gaussian Naive Bayes, we would like to describe how the naive bayes algorithm works. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features. Based on prior knowledge of conditions that may be related to an event, Bayes theorem describes the probability of the event.

So what is Gaussian Naive Bayes? [4] Gaussian Naive Bayes is a type of Naive Bayes method where continuous attributes are considered and the data features follow a Gaussian distribution throughout the dataset. In Sklearn library terminology, Gaussian Naive Bayes is a type of classification algorithm working on continuous normally distributed features that is based on the Naive Bayes algorithm. Before diving deep into this topic we must gain a basic understanding of the principles on which Gaussian Naive Bayes work. Here are some terminologies that can help us gain knowledge and ease our further study. The Naive Bayes classifier is based on Bayes' theorem and the assumption of conditional independence of features. The formula is as follows:

Formulas

Where:

𝑃 (𝐶 𝑘 |x) = 𝑃 ( 𝐶 𝑘 ) • 𝑃 ( x | 𝐶 𝑘 ) 𝑃 (x)(3)

• 𝑃 (𝐶 𝑘 |x) is the posterior probability of class 𝐶 𝑘 given the sample x,

• 𝑃 (𝐶 𝑘 ) is the prior probability of class 𝐶 𝑘 ,

• 𝑃 (x|𝐶 𝑘 ) is the likelihood of sample x given class 𝐶 𝑘 ,

• 𝑃 (x) is the total probability of the sample x. Assuming conditional independence of features x = (𝑥 1 , 𝑥 2 , . . . , 𝑥 𝑛 ), we can write:

𝑃 (x|𝐶 𝑘 ) = 𝑃 (𝑥 1 , . . . , 𝑥 𝑛 |𝐶 𝑘 ) = ∏︁ 𝑃 (𝑥 𝑖 |𝐶 𝑘 )(4)

𝑖=1

Therefore, the final formula for the Naive Bayes classifier is: Calculate the likelihood of 𝑥 for each class using the Gaussian probability density function.

𝑃 (𝐶 𝑘 |x) ∝ 𝑃 (𝐶 𝑘 ) • ∏︁ 𝑃 (𝑥 𝑖 |𝐶 𝑘 )(5)

Calculate posterior probabilities for each class based on the features of point 𝑥.

Select the class with the highest posterior probability as the predicted label for point 𝑥.

Add the predicted label to the predictions list. 9: end for 10: return 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖 𝑜𝑛𝑠

Experiments

Dataset preparing

Weather Dataset [5] contains data from years 1979 to 2021., extracted by MUTHUKUMAR.J. Records, that did not meet following dependency have been removed from the original database:

• Formatted Date • Apparent Temperature (C) • Precip Type • Loud Cover • Daily Summary • Humidity • Wind Bearing (degrees)

In the first phase of testing, we worked on three abstract classes: 'rain,' 'clear,' and 'overcast.' The accuracy of the classifiers was around 90% for KNN and 80% for Naive Bayes. However, the confusion matrices revealed that the count of entities labeled 'rain' in the 'Summary' column was very low. Consequently, the next step was to identify the abstract class with the highest count of entities. To address this, we analyzed the distribution of the 'Summary' column values across the different classes. This analysis helped us determine which class had the highest representation, allowing us to focus our efforts on balancing the dataset and improving the overall performance of the classifiers.Based on this, records in which the abstraction class is not included were removed. These other classes had a negative impact [6] on the model's performance. For example: "Breezy and Mostly Cloudy", "Windy and Foggy", "Windy and Dry", "Dry and Partly Cloudy".

Tests

KNN tests

The first phase of testing involved appropriately reducing the number of classes in the project's dataset to decrease the computational complexity of the model. After data preparation, model testing commenced. Next, the optimal value of 𝑘 for the model was determined. The Matplotlib library [7], which generates graphs, was helpful in this regard. In Figure 1, we observe that our model performs best for 𝑘 = 6. However, in the interval [1,10], the values exhibit significant variability, with stabilization occurring only in the interval (10, 30).

Naive Bayes tests

The Bayes classifier has a good distribution when: The k-nearest neighbor classifier has about 8 percentage points higher accuracy compared to the Gaussian Naive Bayes classifier on fig. 5. This difference can be attributed to several factors. First, KNN is a nonparametric algorithm, meaning that it does not assume any particular distribution of the data. This flexibility allows it to effectively capture complex, nonlinear relationships in the feature space. GNB, on the other hand, assumes that the features have a Gaussian distribution and are independent of the class label. When the actual data distribution deviates from these assumptions, GNB's performance can suffer. Second, KNN relies on the proximity of data points in the feature space, adapting well to different data distributions without making strong assumptions. In addition, KNN can mitigate the impact of outliers and noisy data by considering multiple nearest neighbors, which helps smooth out the impact of anomalous data points. On the other hand, GNB can be inaccurate under the significant influence of outliers, as they can distort the estimation of the parameters of the mean and variance of the Gaussian distribution for each trait. Together, these factors contribute to the higher accuracy we observed for KNN in our tests. From the time comparison on fig. 6, the classifiers have completely different execution times. The KNN algorithm can be more time-consuming, especially for large datasets, due to the need to calculate the distance between each pair of points in the training set. Naive Bayes, on the other hand, being based on a simple probabilistic model, often exhibits lower computational complexity. In addition, differences in running times may also be due to differences in implementations of these algorithms and characteristics of specific data, such as the number of dimensions or the size of the dataset. The F1-score or F1-measure is a measure of predictive performance. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all samples predicted to be positive, including those not identified correctly, and the recall is the number of true positive results divided by the number of all samples that should have been identified as positive. Precision is also known as positive predictive value, and recall is also known as sensitivity in diagnostic binary classification. Using the built-in F1 metric from the sklearn library [9], nine iterations were conducted with data shuffiing to calculate the results. This approach ensured robustness in evaluating the model's performance across multiple trials and varying data distributions. Each iteration involved computing the F1-score, which provides a balanced measure of the classifier's precision and recall, thus capturing its ability to correctly classify positive instances while minimizing false positives and false negatives. The iterative process allowed for a comprehensive assessment of the model's effectiveness in handling different data configurations and revealed insights into its consistency and reliability. As seen on the fig. 7, the results obtained by the F1-score from the sklearn library closely align with the results obtained using the accuracy calculation algorithm implemented by the authors in the tested classifier.

Analysis

F1-Score

Conclusion

To sum up and recap, our study using the London weather dataset provided valuable insights into the functioning and performance of K-Nearest Neighbors (KNN) and the Gaussian Naive Bayes classifier (GNB). KNN is much easier to implement. Its concept is straightforward: it classifies new data points based on the most common class among the nearest neighbors. This simplicity in implementation makes KNN an agttractive option for quick and easy classification tasks. However, it has its limitations. KNN can be slower to classify, especially for large datasets, because it is necessary to calculate the distance between a new point and each point in the training set. This distance calculation can become computationally expensive as the size of the dataset increases, leading to longer classification times. On the other hand, the Bayes classifier, particularly the Naive Bayes classifier, may require more effort at the implementation stage. This is due to the need to calculate and model conditional probabilities and to make feature independence assumptions. Despite this initial complexity, the Naive Bayes classifier can be faster during the classification phase. It only requires calculating the conditional probabilities for each feature and applying Bayes' rule. Throughout of this study, we gained a large dose of knowledge. Implementing these algorithms and benchmarking their performance allowed us to gain practical experience with both KNN and GNB. We discovered firsthand the trade-offs between ease of implementation and computational efficiency.

𝑖=1 2 . 2 . 3 .Algorithm 22232Classifier AlgorithmThe Naive Bayes classifier algorithms are shown below: Description of the Naive Bayes Algorithm Require: 𝑋_ , 𝑡𝑟𝑎𝑖𝑛 𝑦_ , 𝑡𝑟𝑎𝑖𝑛 𝑋_𝑡𝑒𝑠𝑡 1: 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠 ← [] 2: Calculate the prior probabilities for each class using 𝑦_𝑡𝑟𝑎𝑖𝑛. 3: Calculate the mean and variance for each feature for each class using 𝑋_𝑡𝑟𝑎𝑖𝑛 and 𝑦_𝑡𝑟𝑎𝑖𝑛. 4: for 𝑥 in each point in 𝑋_𝑡𝑒𝑠𝑡 do 5:

Figure 1 :1Figure 1: Choosing the best k value

Figure 2 :2Figure 2: Cumulative Distribution Function for Naive Bayes Classifier

•The lines for each class increase rapidly, indicating high probabilities assigned by the model to the correct classes. • Lines for different classes should be separated from each other, indicating that the model distinguishes classes well. • The CDF lines should be close to zero at low probabilities As you can see from fig.2, all of these things are almost maintained, indicating that the classifier predicts quite well. We can confirm this because the classifier has an accuracy of about 75%. It is worth noting on the sudden intersection of the foggy class. The abrupt intersection of the line indicates that the model is uncertain about assigning probabilities to this particular class, which may be the result of an overlap in feature space between this class and other classes.

Figure 3 : 4 :34Figure 3: Confusion Matrix for K Nearest Neighbors Figure 4: Confusion Matrix for Naive Bayes

Figure 5 :5Figure 5: Comparison of the classifier accuracies

Figure 6 :6Figure 6: Comparison of the classifier times

Figure 7 :7Figure 7: F1 Score

BiLSTM deep neural network model for imbalanced medical data of IoT systems MarcinWoźniak MichałWieczorek JakubSiłka Future Generation Computer Systems 141 2023 Neural network powered COVID-19 spread forecasting model MichałWieczorek JakubSiłka MarcinWoźniak Chaos, Solitons & Fractals Issn: 0960-0779 140 110203 2020 K-Nearest Neighbor(KNN) Algorithm in Machine Learning RizwanaYasmeen 2023 Intrusion Detection System using Naive Bayes algorithm BSNagapadma Sharmila 2019 Weather Dataset JMuthukumar 2018 1 Assessing the Impact of Changing Environments on Classifier Performance RocioAlaiz-Rodriguezand NathalieJapkowicz 10.1007/978-3-540-68825-9_2 2008 Matplotlib for Python Developers SandroTosi 2009 Uczenie maszynowe z użyciem Scikit-Learn i TensorFlow, Wydanie II, aktualizacja do modułu TensorFlow AurélienGéron 2020 2 sklearn.metrics.f1_score 2024 scikit-learn