<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Analysis the performance of Naive Bayes and K-Nearest Neighbor Classifiers*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hubert Bojda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dawid Gala</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Applied Mathematics, Silesian University of Technology</institution>
          ,
          <addr-line>Kaszubska 23, 44100 Gliwice</addr-line>
          ,
          <country country="PL">POLAND</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IVUS2024: Information Society and University Studies 2024</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In our study, we implemented and compared two machine learning algorithms: K-Nearest Neighbors (KNN) and Naive Bayes. For each algorithm, we conducted 10 test runs to evaluate their performance. The results indicated that the KNN algorithm achieved an accuracy ranging from 0.80 to 0.82, demonstrating its robustness in predicting weather conditions based on the London's historical weather data. On the other hand, the Naive Bayes algorithm achieved an accuracy ranging from 0.74 to 0.76. Although slightly lower than KNN, these results still reflect the Naive Bayes algorithm's effectiveness in handling the weather data. Overall, this analysis provides valuable insights into the predictive capabilities of these algorithms.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;artificial intelligence</kwd>
        <kwd>London weather data</kwd>
        <kwd>dataset</kwd>
        <kwd>machine learning algorithms</kwd>
        <kwd>K-Nearest Neighbors (KNN)</kwd>
        <kwd>Naive Bayes</kwd>
        <kwd>accuracy</kwd>
        <kwd>F1-score</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Artificial intelligence methods show us examples and uses of machine learning algorithms. This is very
important in today’s world, because more and more systems have more or less developed ai algorithms
implemented. For example, they can be used for deep neural network models for unbalanced medical data of IoT
systems[1] or to predict COVID-19 virus spread [2] This artificial intelligence system was developed to
explore and validate the effectiveness of the K-Nearest Neighbors (KNN) and Gaussian Naive Bayes algorithms. To
achieve this, we selected a weather database, which is particularly well-suited for testing these algorithms due to
its mix of numerical and categorical data. The database includes columns with numerical values such as
temperature, humidity, and wind speed, alongside a column containing categorical information about weather
conditions at the time of observation, including categories like ’Clear’, ’Overcast’, and ’Foggy’. This rich and
diverse dataset facilitates effective training and testing, enabling a thorough evaluation of the algorithms’
performance.</p>
      <p>The numerical data is good for the KNN algorithm, which predicts outcomes based on the distance between
data points. For KNN, we use Euclidean distance measure to findthe closest neighbors to a given data point
and make predictions based on these neighbors. On the other hand, the categorical weather classifications
are well-suited for the Naive Bayes algorithm. Naive Bayes works by calculating the probability of each class
based on the feature distributions and assumes that features are independent given the class label, making it
efficient for categorical data.</p>
      <p>We then divided the dataset into a 70:30 ratio for training and testing. This split provides a substantial
amount of data for training the models while reserving enough data to accurately assess their performance. Our
benchmark tests involved evaluating the algorithms using standard performance metrics such as accuracy, precision,
recall, and F1-score. These metrics offer a comprehensive view of the algorithms’ ability to classify weather
conditions correctly. Additionally, we performed cross-validation to ensure that our results were not overly
dependent on a particular train-test split, further validating the robustness and reliability of our models.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <sec id="sec-2-1">
        <title>2.1. K Nearest Neighbors</title>
        <sec id="sec-2-1-1">
          <title>2.1.1. Description</title>
          <p>The KNN classifier[3], or k nearest neighbor algorithm, is used to classify and predict the value based on the
variable specified in the decision column in the database. The algorithm compares the values in the columns
that explain the phenomenon with the values of the variables that are included in the learning set. It contains
information about the k closest observations from the learning set.</p>
          <p>An important aspect in the creation of a classifier is the selection of an appropriate metric that calculates the
distance between the observations of the learning set and the training set. The most popular metrics are
Euclidean, Minkowski or Manhattan.</p>
          <p>With successive iterations, the division of the data is corrected against the given metric. The algorithm
moves data between classes so that the variance within each class is as smallest</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.1.2. Formulas</title>
          <p>1. Calculating Distance Between Points:
The Euclidean distance  between two points  = (1, 2, . . . , ) and  = (1, 2, . . . , ) is given by:
2. Finding Nearest Neighbors</p>
          <p>To find the  nearest neighbors for a test point, compute the Euclidean distances from the test point to all
points in the training set and select the  points with the smallest distances.
3. Classification by Majority Voting</p>
          <p>For classification,the class of the test point is determined by the classes of its  nearest neighbors. The class
 of the test point is given by:
where 1( = ) is an indicator function that equals 1 if  =  and 0 otherwise.</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>2.1.3. Classifier Algorithm</title>
          <p>The KNN classifier algorithms are shown below:
Algorithm 1 KNN Algorithm
Require: _, _, _, _
1:  ← []
2: for  in _ do
3: Calculate and sort distances from  to _.
4: Select  nearest neighbors’ labels.
5: Perform majority voting to determine the most frequent label.
6: Add the most frequent label to .
7: Calculate the accuracy by comparing  to _.
8: end for
9: return</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Naive Bayes</title>
        <sec id="sec-2-2-1">
          <title>2.2.1. Description</title>
          <p>Before describing Gaussian Naive Bayes, we would like to describe how the naive bayes algorithm works.
A naive Bayes classifier considers each of these features to contribute independently to the probability thatthis
fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter
features. Based on prior knowledge of conditions that may be related to an event, Bayes theorem describes the
probability of the event.</p>
          <p>So what is Gaussian Naive Bayes?[4] Gaussian Naive Bayes is a type of Naive Bayes method where continuous
attributes are considered and the data features follow a Gaussian distribution throughout the dataset. In Sklearn library
terminology, Gaussian Naive Bayes is a type of classification algorithm working on continuous normally distributed
features that is based on the Naive Bayes algorithm. Before diving deep into this topic we must gain a basic
understanding of the principles on which Gaussian Naive Bayes work. Here are some terminologies that can
help us gain knowledge and ease our further study. The Naive Bayes classifieris based on Bayes’ theorem
and the assumption of conditional independence of features. The formula is as follows:</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.2. Formulas</title>
          <p>Where:
 ( |x) =
 () ·  (x|)
 (x)
•  (|x) is the posterior probability of class  given the sample x,
•  () is the prior probability of class ,
•  (x|) is the likelihood of sample x given class ,
•  (x) is the total probability of the sample x.</p>
          <p>Assuming conditional independence of features x = ( 1,  2, . . . , ), we can write:
Therefore, the final formula for the Naive Bayes classifier is:
 (x|) =  (1, . . . , |) =</p>
          <p>(|)
 (|x) ∝  () ·</p>
          <p>∏︁
∏︁</p>
          <p>(|)
=1

=1</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>2.2.3. Classifier Algorithm</title>
          <p>The Naive Bayes classifier algorithms are shown below:
Algorithm 2 Description of the Naive Bayes Algorithm
Require: _, _, _
1:  ← []
2: Calculate the prior probabilities for each class using _.
3: Calculate the mean and variance for each feature for each class using _ and _.
4: for  in each point in _ do
5: Calculate the likelihood of  for each class using the Gaussian probability density function.
6: Calculate posterior probabilities for each class based on the features of point  .
7: Select the class with the highest posterior probability as the predicted label for point  .
8: Add the predicted label to the predictions list.
9: end for
10: return 
(3)
(4)
(5)</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset preparing</title>
        <p>Weather Dataset[5] contains data from years 1979 to 2021., extracted by MUTHUKUMAR.J. Records, that did not
meet following dependency have been removed from the original database:
In the first phase of testing, we worked on three abstract classes: ’rain,’ ’clear,’ and ’overcast.’ The accuracy of the
classifiers was around 90% for KNN and 80% for Naive Bayes. However, the confusion matrices revealed that the
count of entities labeled ’rain’ in the ’Summary’ column was very low. Consequently, the next step was to
identify the abstract class with the highest count of entities. To address this, we analyzed the distribution of
the ’Summary’ column values across the different classes. This analysis helped us determine which class had
the highest representation, allowing us to focus our efforts on balancing the dataset and improving the overall
performance of the classifiers.Based on this, records in which the abstraction class is not included were removed.
These other classes had a negative impact[6] on the model’s performance. For example: "Breezy and Mostly
Cloudy", "Windy and Foggy", "Windy and Dry", "Dry and Partly Cloudy".
3.2. Tests
3.2.1. KNN tests
The first phase of testing involved appropriately reducing the number of classes in the project’s dataset to
decrease the computational complexity of the model. After data preparation, model testing commenced. Next, the
optimal value of  for the model was determined. The Matplotlib library[7], which generates graphs, was helpful
in this regard. In Figure 1, we observe that our model performs best for  = 6. However, in the interval [1, 10], the
values exhibit significant variability, with stabilization occurring only in the interval (10, 30).</p>
        <sec id="sec-3-1-1">
          <title>3.2.2. Naive Bayes tests</title>
          <p>The Bayes classifier has a good distribution when:
• The lines for each class increase rapidly, indicating high probabilities assigned by the model to the
correct classes.
• Lines for different classes should be separated from each other, indicating that the model distinguishes
classes well.</p>
          <p>• The CDF lines should be close to zero at low probabilities
As you can see from fig.2, all of these things are almost maintained, indicating that the classifier predicts quite
well. We can confirm this because the classifier has an accuracy of about 75%. It is worth noting on the sudden
intersection of the foggy class. The abrupt intersection of the line indicates that the model is uncertain about
assigning probabilities to this particular class, which may be the result of an overlap in feature space between this
class and other classes.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. Analysis</title>
        <p>
          On fig.3 and fig.4 are confusion matrixes[8] for both classifiers. As you can see, most classes are predicted as
needed. An interesting aspect is that both classifiers have a problem predicting “overcast”. If they do not
predict well, their second choice is “clear”.
The k-nearest neighbor classifier has about 8 percentage points higher accuracy compared to the Gaussian
Naive Bayes classifier on fig.5.This difference can be attributed to several factors. First, KNN is a
nonparametric algorithm, meaning that it does not assume any particular distribution of the data. This flexibility
allows it to effectively capture complex, nonlinear relationships in the feature space. GNB, on the other hand,
assumes that the features have a Gaussian distribution and are independent of the class label. When the actual
data distribution deviates from these assumptions, GNB’s performance can suffer. Second, KNN relies on the
proximity of data points in the feature space, adapting well to different data distributions without making
strong assumptions. In addition, KNN can mitigate the impact of outliers and noisy data by considering
multiple nearest neighbors, which helps smooth out the impact of anomalous data points. On the other
hand, GNB can be inaccurate under the significant influence of outliers, as they can distort the estimation ofthe
parameters of the mean and variance of the Gaussian distribution for each trait. Together, these factors
contribute to the higher accuracy we observed for KNN in our tests.
From the time comparison on fig.6, the classifiers have completely different execution times.The KNN
algorithm can be more time-consuming, especially for large datasets, due to the need to calculate the distance
between each pair of points in the training set. Naive Bayes, on the other hand, being based on a simple
probabilistic model, often exhibits lower computational complexity. In addition, differences in running times may
also be due to differences in implementations of these algorithms and characteristics of specific data, such as the
number of dimensions or the size of the dataset.
3.3.1. F1-Score
The F1-score or F1-measure is a measure of predictive performance. It is calculated from the precision and recall
of the test, where the precision is the number of true positive results divided by the number of all samples predicted to
be positive, including those not identified correctly, and the recall is the number of true positive results divided
by the number of all samples that should have been identified as positive. Precision is also known as positive
predictive value, and recall is also known as sensitivity in diagnostic binary classification. Using the built-in F1
metric from the sklearn library[
          <xref ref-type="bibr" rid="ref6">9</xref>
          ], nine iterations were conducted with data shuffiing to calculate the results. This
approach ensured robustness in evaluating the model’s performance across multiple trials and varying data
distributions. Each iteration involved computing the F1-score, which provides a balanced measure of the classifier’s
precision and recall, thus capturing its ability to correctly classify positive instances while minimizing false
positives and false negatives. The iterative process allowed for a comprehensive assessment of the model’s
effectiveness in handling different data configurations and revealed insights into its consistency and reliability.
        </p>
        <p>As seen on the fig.7,the results obtained by the F1-score from the sklearn library closely align with the
results obtained using the accuracy calculation algorithm implemented by the authors in the tested classifier.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>To sum up and recap, our study using the London weather dataset provided valuable insights into the
functioning and performance of K-Nearest Neighbors (KNN) and the Gaussian Naive Bayes classifier (GNB).
KNN is much easier to implement. Its concept is straightforward: it classifies new data points based on the
most common class among the nearest neighbors. This simplicity in implementation makes KNN an agttractive option
for quick and easy classification tasks. However, it has its limitations. KNN can be slower to classify,
especially for large datasets, because it is necessary to calculate the distance between a new point and each
point in the training set. This distance calculation can become computationally expensive as the size of the
dataset increases, leading to longer classification times. On the other hand, the Bayes classifier, particularlythe
Naive Bayes classifier, may require more effort at the implementation stage. This is due to the need to
calculate and model conditional probabilities and to make feature independence assumptions. Despite this
initial complexity, the Naive Bayes classifiercan be faster during the classificationphase. It only requires
calculating the conditional probabilities for each feature and applying Bayes’ rule. Throughout of this study, we
gained a large dose of knowledge. Implementing these algorithms and benchmarking their performance
allowed us to gain practical experience with both KNN and GNB. We discovered firsthand the trade-offs
between ease of implementation and computational efficiency.</p>
      <p>R Nagapadma BS Sharmila. “Intrusion Detection System using Naive Bayes algorithm”. In: (2019). URL:
https : / / ieeexplore . ieee . org / stamp / stamp . jsp ? arnumber = 9019921 % 5C &amp; casa _token = FEZMEU72iF8AAAAA : 4D1LZ _
1ZcT7dqbDdxFSbDGfqnG8TMb-vwrGeDgnZRzxV7YMyJGNupv8dmhmhkpsq2C6SJqZAmxc.</p>
      <p>MUTHUKUMAR.J. “Weather Dataset”. In: 1 (2018). URL: https://www.kaggle.com/datasets/muthuj7/weather-dataset.
Rocio Alaiz-Rodriguezand Nathalie Japkowicz. “Assessing the Impact of Changing Environments on
Classifier Performance”. In: (2008).URL: https://link.springer.com/chapter/10.1007/978-3-540-68825-9_2.</p>
      <p>Sandro Tosi. “Matplotlib for Python Developers”. In: (2009).</p>
      <p>Aurélien Géron. “Uczenie maszynowe z użyciem Scikit-Learn i TensorFlow, Wydanie II, aktualizacja do
modułu TensorFlow 2”. In: (2020).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Marcin</given-names>
            <surname>Woźniak</surname>
          </string-name>
          , Michał Wieczorek, and Jakub Siłka. “
          <article-title>BiLSTM deep neural network model for imbalanced medical data of IoT systems”</article-title>
          .
          <source>In: Future Generation Computer Systems</source>
          <volume>141</volume>
          (
          <year>2023</year>
          ), pp.
          <fpage>489</fpage>
          -
          <lpage>499</lpage>
          . URL: https://www.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>sciencedirect.com/science/article/pii/S0167739X22004095.</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Michał</given-names>
            <surname>Wieczorek</surname>
          </string-name>
          , Jakub Siłka, and Marcin Woźniak. “
          <article-title>Neural network powered COVID-19 spread forecasting model”</article-title>
          .
          <source>In: Chaos, Solitons &amp; Fractals</source>
          <volume>140</volume>
          (
          <year>2020</year>
          ), p.
          <fpage>110203</fpage>
          . Issn:
          <fpage>0960</fpage>
          -
          <lpage>0779</lpage>
          . URL: https : / / www.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>sciencedirect.com/science/article/pii/S0960077920305993.</mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Rizwana</given-names>
            <surname>Yasmeen. “K-Nearest Neighbor</surname>
          </string-name>
          (
          <article-title>KNN) Algorithm in Machine Learning”</article-title>
          . In: (
          <year>2023</year>
          ). URL: https: //medium.com/@rizwanayasmeen06/
          <article-title>k-nearest-neighbor-knn-algorithm-in-machine-learning-d38d9638d7e0.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>[9] scikit-learn. “sklearn.metrics.f1_score”</article-title>
          . In: (
          <year>2024</year>
          ). URL: https://scikit-learn.org/stable/modules/generated/sklearn. metrics.f1_score.html.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>