<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text classification algorithms*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Katarzyna Czernik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karolina Kamela</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Applied Mathematics, Silesian University of Technology</institution>
          ,
          <addr-line>Kaszubska 23, 44100 Gliwice</addr-line>
          ,
          <country country="PL">POLAND</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IVUS2024: Information Society and University Studies 2024</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In the article, we will describe and compare the operation of two classifiers: K-Nearest Neighbors and Naive Bayes Classifier. We will focus primarily on the application of these algorithms in text analysis. The division of texts is made into three classes of abstraction: SPORTS, FOOD &amp; DRINK, and HOME &amp; LIVING, which correspond to the categories of the texts we selected. We evaluate the classifiers based on key metrics such as accuracy and execution time, providing a detailed analysis of their performance across different parameter settings and dataset sizes. The experimental setup involved multiple runs to ensure the robustness of the results, and the findingswere averaged for consistency. Overall, this comparison provides valuable insights into the practical applications of KNN and Naive Bayes Classifiers in text classificationtasks, guiding the choice of algorithm based on specificneeds such as accuracy, speed, and computational resources. For our study we used programs written in Python, using libriares: pandas, numby, seaborn, matplotlib.pyplot and sklearn Average results of accuracy is 99.0222% for KNN and 91.3333% for Naive Bayes classifier. The advantage in accuracy lies with KNN; however, the operational time required to achieve such a result amounts to as much as 173.8866 s, whereas the Bayesian classifier is capable of analyzing a dataset of the same size in an average of 0.2897 s.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;KNN</kwd>
        <kwd>Naive Bayes Classfier</kwd>
        <kwd>Text analisys</kwd>
        <kwd>Comparision</kwd>
        <kwd>Machine Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Text classification is a crucial task in natural language processing (NLP), enabling the automated
categorization of text into predefinedcategories. In this study, we explore and compare the
effectiveness of two popular classifiers: Naïve Bayes Classifier and K-Nearest Neighbor(KNN),
within the context of text analysis.</p>
      <p>
        In text classificationKNN uses all the training data sets which makes process of measuring
and sorting beacome more complex and time-consuming. It often shows different results from
different samples [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]: „KNN still suffers from inductive biases or model misfits that result
from its assumptions, such as the presumption that training data are evenly distributed among all
categories”. K-Nearest Neighbor algorithm has been developed by adding and modyfying
various improvement schemes [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        The second method we will use is Naive Bayes Classifier which is most often used as a baseline
intext classification because it is fast and easy to implement [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Some may say that the Naive
Bayes classifier is currently experiencing a renaissance in machine learning[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: in numerous
head-to-head classificationpapers [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ][
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] it has been earning nearly las tor even last places.
      </p>
      <p>
        Existing solutions offer different trade-offs in terms of accuracy, computational complexity,
and scalability. KNN is known for its simplicity and effectiveness in various domains, while
Naive Bayes is appreciated for its strong probabilistic foundation and efficiency. By comparing
these classifiers, we aim to provide insights into their relative strengths and weaknesses,
particularly in handling text data. There were some attempts to, similarly to us, compare those two
classifiers [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [10]. Based on obtained data we can analyze both algorythms advantages and
disadvantages, which could result with new and innovaitve ideas of potencial application of
classifiers.
      </p>
      <p>
        Authors of that article[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] had the idea to construct a new classifier that combines the
distance-based algorithm K-Nearest Neighbor and statistical based Naïve Bayes Classifier in
order to increase effectivnes and accuracy . As it is noticed in this article[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] both alghorithms
have their weeknesses. In case of K Nearest Neighbor algorithm the issues is caused by
problems regarding categorical attributes, whereas Naïve Bayes Classifier have issue handling
numerical atributes.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology of KNN</title>
      <p>KNN (k-Nearest Neighbors) identifies k neighbors to which the examined element is closest to. We
can use different metrics to measure the similarity between data points.</p>
      <p>In the project we used the Euclidean metric. For two points p = ( 1,  2, . . . , ) and q = ( 1,  2,
. . . , ), the Euclidean distance  (p, q) is given by the formula:
W wyniku analizy identycznego zbioru danych różne metryki mogą zwracać różne rezultaty[11].
Formulas for KNN’s most popular metrics:
Minkowski Metric:
where:
where:
•  (p, q) is the distance between points p and q,
•  and  are the coordinates of points in the i-th dimension
•  is the Minkowski metric parameter.</p>
      <p>Manhattan Metric:
(p, q) =</p>
      <p>| − |
∑︁
=1
•  (p, q) is the distance between points p and q,
•  and  are the coordinates of points in the -th dimention.</p>
      <p>After measuring distances beetwen the element and all elements from training set, the
points are sorted based on their distance from the examined element. The k nearest elements
are selected. Then classifier assigns the class label to the element based on the majority class
among its k nearest neighbors.</p>
      <sec id="sec-2-1">
        <title>Calculation Example:</title>
        <p>Determine, for k=3, to which set the element * belongs among the sets of elements shown in
the chart:</p>
        <p>The example (Figure 1.) includes three classes of abstraction: ’green’, ’yellow’, and ’pink’.
Looking at the chart or calculating the distances from point * to the other elements, we can
determine that the k nearest elements to * are: ’pink’, ’yellow’, ’pink’.</p>
        <p>Then, through voting, the classifier determines which class of abstraction is most common
among the k nearest neighbors. In this case, it is ’pink’.</p>
        <sec id="sec-2-1-1">
          <title>Data: Training set  , test point  , number of neighbors</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>Result: Predicted class for point</title>
          <p>1 for each point  in  do
2</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Calculate the distance (, );</title>
          <p>3 Sort points  by distance (, );</p>
        </sec>
        <sec id="sec-2-1-4">
          <title>4 Select the  nearest neighbors;</title>
          <p>5 Obtain the class labels of the neighbors;
6 Choose the most frequent class label as the predicted class;
7 return Predicted class;</p>
          <p>In the project we divided our database into training set and validation set. We used 70 : 30
proportion, so for the base consisting 1500 elements, validation set has 450 elements and training set
has 1050 elements. Alghorithm predicts class of the element taken from validation set, when
it checks accuracy of the guess. This action is reapeated for every element in validation set.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methotology of Naive Bayes Classifier</title>
      <p>Description of Operation
The Bayesian classifier operates based on three key elements:
• Prior probability (initial probability): This is our initial belief about the chances
that a given text belongs to each category. In our case, these probabilities are calculated
based
on the ratio of the number of words in the dictionary of each category to the total number of
words in all dictionaries. Since each dictionary has 150 words, this probability will be the
same for all classes and will be 1 .</p>
      <p>3
• Conditional probability of words in the class: This refers to the probability that a
given word will appear in a text that belongs to a specificcategory. This is calculated
based on the frequency of words in a given category. To avoid a scenario where a word
has a probability of 0, we use Laplace smoothing, which in this case means adding 1 to
the occurrence of each word in the text.
• Conditional probability of text in the class: This is what we actually want to find
out. It is the probability that a given text belongs to a specific category, based on the
words that appear in it. It is calculated based on the conditional probability of words in the
class, for all the words in the text.</p>
      <p>The conditional probability of a word  for a class  is calculated as:
•  ( | ) - denotes the conditional probability of the word  in the class 
• , - is the number of occurrences of the word  in the class 
•  - is the total number of words in the class 
•  - denotes the number of unique words in all classes
The conditional probability of a text  for a class  is calculated as:
•  ( ) is the prior probability of the class  .</p>
      <p>•  ( | ) is the probability of the word  occurring in the class  .</p>
      <p>Logarithmization allows for summing the logarithms of probabilities instead of multiplying the
probabilities themselves, which is numerically more stable.</p>
      <p>With Laplace smoothing, if a word  does not appear in the variable storing the probabilities of
word occurrences for the class  , we use:
• total_count is the sum of all word occurrences in that class plus 1 for each word
(smoothing).
• len(self.vocab) is the number of unique words in all dictionaries.
• v is a vector
• L(v) is a length of the vector
The conditional probability
•  denotes the class to which we assign the text.</p>
      <p>•  denotes the features of the text, i.e., the words in the text.</p>
      <p>Algorithm</p>
      <p>Data: Data dictionaries:</p>
      <p>Result: Dictionary of word probabilities: _
1 _ ← {};
2 foreach class , words  in  do
3  _ ← defaultdict with default value 1;
4 foreach word  in  do</p>
      <p>2: Calculating conditional probabilities of words.
5
9 _ ← class with the highest score in _;
10 return _</p>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>4.1. Database
The database was constructed based on the News Category Dataset available on the Kaggle
platform. From the original database, the following were utilized: headline category and
headline + short description - combined into one column of the table named "Text".
Then we created 150 words long dictonaries consisting most commonly used vocabulary in
texts, divided by categories: SPORTS, HOME&amp;LIVING, and FOOD&amp;DRINK.
Using a Python program, the number of words occurring in the text matching the selected
categories was calculated. Subsequently, appropriate columns describing these numbers were
created (Table 1)
Final database is made out of 1500 records.</p>
      <p>Here are some of the records.
4.2. Tests</p>
      <sec id="sec-4-1">
        <title>Test for KNN:</title>
        <p>
          The charts(Figure 2.) depict the relationships between the characteristic features of individual
abstraction classes and their intensity in the case of elements from other classes. The
dispersion of elements has been presented, with colors corresponding to the following classes
text categories analyzed in this project: ’green’ - ’SPORTS’, ’yellow’ - ’HOME &amp; LIVING’, and
’pink’ - ’FOOD &amp; DRINK’.
Five tests of the program’s accuracy were conducted (Figure 3.) for values of k in the range
&lt;2;8&gt;. The results were then averaged (Figure 4.).
Results:
Average Accuracy for KNN = 99,0222 %
Average time needed to analyse whole base for chosen k = 173,8866626 s
Conclusion:
Depending on the content of the training set, the k-nearest neighbors algorithm demonstrates
variable accuracy in text analysis [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. This stems from the uneven distribution of elements
across each abstraction class in the training set, as well as the varying intensity of training
attributes among the elements. Therefore, even when operating on the same database, with each
new partition of elements into training sets, the results will differ slightly (Figure 3.).
The tests were conducted on the entire database consisting of 1500 records. The ratio of the
training set to the validation set is 70:30.
        </p>
        <p>Test for Naive Bayes Classifier:
of 500 rows, the accuracy slowly normalized until, with 1500 rows, it reached a level of just over
93%. Average Accuracy for Naive Bayes Classifier:93% Average time needed to analyse database:
0,15 s
4.3. Experiments with KNN</p>
        <p>Experiment 1:
Examine the accuracy and operation time of the program depending on k:
of neighbors, which leads to a loss of the model’s ability to capture local patterns in the data.
4.4. Comparing KNN with Naive Bayes Classifier
Experiment 1:
Conclusions of Figure 9.:
It is noticeable that the algorithm correctly classifies almost all of the texts. From 450 texts it did
not classified only 4.</p>
        <p>Conclusion:
Most problematic category for both classifiers is category HOME&amp;LIVING. KNN’s accuracy is
overall a little better than Bayes’s.</p>
        <p>Experiment 2:
Examine the accuracy and operation time of the program depending on the number of analyzed
elements:
The entire database is taken, then it is randomly shuffied. The validation and training sets are
created from the firstn elements of the shuffied database. In KNN k=10.
Conclusions of Figure 11.:
The accuracy of the classifier for a database size from 100 to 300 elements is lower than for
larger n in case of both classifier. The functions shown in the chart are not monotonic; their
values, regardless of the database size, slightly increase or decrease. However for bigger values of n
accuracy begin to stabilize.
The time required to perform KNN operations increases with the size of the database. Using
Lagrange interpolation, the formula can be determined:
Time needed to perform Naive Bayes Classifier also increases, but comparing to KNN scale of
rise is inconsiderable: 0.02 s to 0.29 s.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In conclusion, our comparative study of K-Nearest Neighbors (KNN) and Naive Bayes Classifier for
text classification reveals distinct advantages and limitations of each method. KNN
demonstrated superior accuracy, achieving an average accuracy of 99.02%, which significantly
outperformed the Naive Bayes Classifier’s 91.33%. However, this accuracy came at a cost, with KNN
requiring considerably more computational time (173.89 seconds on average) compared to the much
faster Naive Bayes (0.29 seconds on average).</p>
      <p>Overall, while KNN provides higher accuracy, its computational demands make it less suitable for
large-scale applications compared to Naive Bayes, which offers a good balance of speed and
accuracy. For applications where speed is critical and slight accuracy trade-offs are acceptable,
Naive Bayes is preferable. Conversely, for scenarios demanding the highest possible accuracy
and where computational resources are ample, KNN is the better choice. This study
underscores the importance of selecting the appropriate classifier based on the specific
requirements and constraints of the application at hand.
10. Jayaprakash, Sreemathy &amp; Balamurugan, P. (2022). AN EFFICIENT TEXT
CLASSIFICATION USING KNN AND NAIVE BAYESIAN. International Journal
on Computer Science and Engineering. 4. 392-396.
11. Chomboon, Kittipong &amp; Chujai, Pasapitch &amp; Teerarassammee, Pongsakorn &amp;
Kerdprasop, Kittisak &amp; Kerdprasop, Nittaya. (2015). An Empirical Study of Distance
Metrics for k-Nearest Neighbor Algorithm. 280-285. 10.12792/iciae2015.051.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ferdousy</surname>
            , Elma &amp; Islam, Md &amp; Matin,
            <given-names>M.</given-names>
          </string-name>
          . (
          <year>2013</year>
          ).
          <article-title>Combination of Naïve Bayes Classifier and K-Nearest Neighbor (cNK) in the Classification Based Predictive Models</article-title>
          .
          <source>Computer and Information Science</source>
          .
          <volume>6</volume>
          . 10.5539/cis.v6n3p48.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Muliono</surname>
            , Yohan &amp; Tanzil,
            <given-names>Fidelson.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>A Comparison of Text Classification Methods kNN, Naïve Bayes, and Support Vector Machine for News Classification</article-title>
          .
          <source>Jurnal Informatika: Jurnal Pengembangan IT. 3</source>
          .
          <fpage>157</fpage>
          -
          <lpage>160</lpage>
          .
          <fpage>10</fpage>
          .30591/jpit.v3i2.
          <fpage>828</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>Songbo.</given-names>
          </string-name>
          (
          <year>2006</year>
          ).
          <article-title>An effective refinement strategy for KNN text classifier</article-title>
          .
          <source>Expert Systems with Applications</source>
          .
          <volume>30</volume>
          .
          <fpage>290</fpage>
          -
          <lpage>298</lpage>
          .
          <fpage>10</fpage>
          .1016/j.eswa.
          <year>2005</year>
          .
          <volume>07</volume>
          .019.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>J. D. M. Rennie</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Shih</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Teevan</surname>
            , and
            <given-names>D. R.</given-names>
          </string-name>
          <string-name>
            <surname>Karger</surname>
          </string-name>
          , “
          <article-title>Tackling the Poor Assumptions of Naive Bayes Text Classifiers</article-title>
          ,” no.
          <year>1973</year>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lewis</surname>
            ,
            <given-names>D. D.</given-names>
          </string-name>
          (
          <year>1998</year>
          ).
          <article-title>Naive (Bayes) at forty: The independence assumption in information retrieval</article-title>
          .
          <source>Proceedings of ECML '98.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          (
          <year>1999</year>
          ).
          <article-title>A re-examination of text categorization methods</article-title>
          .
          <source>Proceedings of SIGIR '99</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Joachims</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>1998</year>
          ).
          <article-title>Text categorization with support vector machines: Learning with many relevant features</article-title>
          .
          <source>Proceedings of ECML '98.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Zhang,
          <string-name>
            <given-names>T.</given-names>
            , &amp;
            <surname>Oles</surname>
          </string-name>
          ,
          <string-name>
            <surname>F. J.</surname>
          </string-name>
          (
          <year>2001</year>
          ).
          <article-title>Text categorization based on regularized linear classification methods</article-title>
          .
          <source>Information Retrieval</source>
          ,
          <volume>4</volume>
          ,
          <fpage>5</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Sun</surname>
            , Jingwen &amp; Du, Weixing &amp; Shi,
            <given-names>Niancai.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>A Survey of kNN Algorithm</article-title>
          .
          <source>Information Engineering and Applied Computing</source>
          .
          <volume>1</volume>
          . 10.18063/ieac.v1i1.
          <fpage>770</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>