<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Text classification algorithms *</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Katarzyna</forename><surname>Czernik</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Applied Mathematics</orgName>
								<orgName type="institution">Silesian University of Technology</orgName>
								<address>
									<addrLine>Kaszubska 23</addrLine>
									<postCode>44100</postCode>
									<settlement>Gliwice</settlement>
									<country key="PL">POLAND</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Karolina</forename><surname>Kamela</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Applied Mathematics</orgName>
								<orgName type="institution">Silesian University of Technology</orgName>
								<address>
									<addrLine>Kaszubska 23</addrLine>
									<postCode>44100</postCode>
									<settlement>Gliwice</settlement>
									<country key="PL">POLAND</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">Information Society</orgName>
								<orgName type="institution">University Studies</orgName>
								<address>
									<addrLine>2024, May 17</addrLine>
									<settlement>Kaunas</settlement>
									<country key="LT">Lithuania</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Text classification algorithms *</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">BC8042F8574A0A07365D24634CA59A33</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:28+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>KNN</term>
					<term>Naive Bayes Classfier</term>
					<term>Text analisys</term>
					<term>Comparision</term>
					<term>Machine Learning</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>In the article, we will describe and compare the operation of two classifiers: K-Nearest Neighbors and Naive Bayes Classifier. We will focus primarily on the application of these algorithms in text analysis. The division of texts is made into three classes of abstraction: SPORTS, FOOD &amp; DRINK, and HOME &amp; LIVING, which correspond to the categories of the texts we selected. We evaluate the classifiers based on key metrics such as accuracy and execution time, providing a detailed analysis of their performance across different parameter settings and dataset sizes. The experimental setup involved multiple runs to ensure the robustness of the results, and the findings were averaged for consistency. Overall, this comparison provides valuable insights into the practical applications of KNN and Naive Bayes Classifiers in text classification tasks, guiding the choice of algorithm based on specific needs such as accuracy, speed, and computational resources. For our study we used programs written in Python, using libriares: pandas, numby, seaborn, matplotlib.pyplot and sklearn Average results of accuracy is 99.0222% for KNN and 91.3333% for Naive Bayes classifier. The advantage in accuracy lies with KNN; however, the operational time required to achieve such a result amounts to as much as 173.8866 s, whereas the Bayesian classifier is capable of analyzing a dataset of the same size in an average of 0.2897 s.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Text classification is a crucial task in natural language processing (NLP), enabling the automated categorization of text into predefined categories. In this study, we explore and compare the effectiveness of two popular classifiers: Naïve Bayes Classifier and K-Nearest Neighbor(KNN), within the context of text analysis. In text classification KNN uses all the training data sets which makes process of measuring and sorting beacome more complex and time-consuming. It often shows different results from different samples <ref type="bibr" target="#b2">[3]</ref>: "KNN still suffers from inductive biases or model misfits that result from its assumptions, such as the presumption that training data are evenly distributed among all categories". K-Nearest Neighbor algorithm has been developed by adding and modyfying various improvement schemes <ref type="bibr" target="#b8">[9]</ref>. The second method we will use is Naive Bayes Classifier which is most often used as a baseline intext classification because it is fast and easy to implement <ref type="bibr" target="#b3">[4]</ref>. Some may say that the Naive Bayes classifier is currently experiencing a renaissance in machine learning <ref type="bibr" target="#b4">[5]</ref>: in numerous head-to-head classification papers <ref type="bibr" target="#b5">[6]</ref> [7] <ref type="bibr" target="#b7">[8]</ref> it has been earning nearly las tor even last places.</p><p>Existing solutions offer different trade-offs in terms of accuracy, computational complexity, and scalability. KNN is known for its simplicity and effectiveness in various domains, while Naive Bayes is appreciated for its strong probabilistic foundation and efficiency. By comparing these classifiers, we aim to provide insights into their relative strengths and weaknesses, particularly in handling text data. There were some attempts to, similarly to us, compare those two classifiers <ref type="bibr" target="#b1">[2]</ref>  <ref type="bibr" target="#b9">[10]</ref>. Based on obtained data we can analyze both algorythms advantages and disadvantages, which could result with new and innovaitve ideas of potencial application of classifiers. Authors of that article <ref type="bibr" target="#b0">[1]</ref> had the idea to construct a new classifier that combines the distance-based algorithm K-Nearest Neighbor and statistical based Naïve Bayes Classifier in order to increase effectivnes and accuracy . As it is noticed in this article <ref type="bibr" target="#b0">[1]</ref> both alghorithms have their weeknesses. In case of K Nearest Neighbor algorithm the issues is caused by problems regarding categorical attributes, whereas Naïve Bayes Classifier have issue handling numerical atributes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>∑︁</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Methodology of KNN</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Description of Operation</head><p>KNN (k-Nearest Neighbors) identifies k neighbors to which the examined element is closest to. We can use different metrics to measure the similarity between data points.</p><p>In the project we used the Euclidean metric. For two points p = (𝑝 1 , 𝑝 2 , . . . , 𝑝 𝑛 ) and q = (𝑞 1 , 𝑞 2 , . . . , 𝑞 𝑛 ), the Euclidean distance 𝑑(p, q) is given by the formula: W wyniku analizy identycznego zbioru danych różne metryki mogą zwracać różne rezultaty <ref type="bibr" target="#b10">[11]</ref>. Formulas for KNN's most popular metrics: Minkowski Metric: where:</p><p>• 𝑑(p, q) is the distance between points p and q, • 𝑝 𝑖 and 𝑞 𝑖 are the coordinates of points in the i-th dimension • 𝑟 is the Minkowski metric parameter. Manhattan Metric:</p><p>where:</p><formula xml:id="formula_0">𝑛 𝑑(p, q) = |𝑝 𝑖 − 𝑞 𝑖 | 𝑖=1</formula><p>• 𝑑(p, q) is the distance between points p and q,</p><p>• 𝑝 𝑖 and 𝑞 𝑖 are the coordinates of points in the 𝑖-th dimention.</p><p>After measuring distances beetwen the element and all elements from training set, the points are sorted based on their distance from the examined element. The k nearest elements are selected. Then classifier assigns the class label to the element based on the majority class among its k nearest neighbors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Calculation Example:</head><p>Determine, for k=3, to which set the element * belongs among the sets of elements shown in the chart:</p><p>The example (Figure <ref type="figure" target="#fig_0">1</ref>.) includes three classes of abstraction: 'green', 'yellow', and 'pink'. Looking at the chart or calculating the distances from point * to the other elements, we can determine that the k nearest elements to * are: 'pink', 'yellow', 'pink'. Then, through voting, the classifier determines which class of abstraction is most common among the k nearest neighbors. In this case, it is 'pink'. In the project we divided our database into training set and validation set. We used 70 : 30 proportion, so for the base consisting 1500 elements, validation set has 450 elements and training set has 1050 elements. Alghorithm predicts class of the element taken from validation set, when it checks accuracy of the guess. This action is reapeated for every element in validation set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methotology of Naive Bayes Classifier Description of Operation</head><p>The Bayesian classifier operates based on three key elements:</p><p>• Prior probability (initial probability): This is our initial belief about the chances that a given text belongs to each category. In our case, these probabilities are calculated based on the ratio of the number of words in the dictionary of each category to the total number of words in all dictionaries. Since each dictionary has 150 words, this probability will be the same for all classes and will be 1 . • Conditional probability of words in the class: This refers to the probability that a given word will appear in a text that belongs to a specific category. This is calculated based on the frequency of words in a given category. To avoid a scenario where a word has a probability of 0, we use Laplace smoothing, which in this case means adding 1 to the occurrence of each word in the text. • Conditional probability of text in the class: This is what we actually want to find out. It is the probability that a given text belongs to a specific category, based on the words that appear in it. It is calculated based on the conditional probability of words in the class, for all the words in the text.</p><p>The conditional probability of a word 𝑤 for a class 𝐶 is calculated as:</p><p>where:</p><p>• 𝑃 (𝑤|𝐶) -denotes the conditional probability of the word 𝑤 in the class 𝐶 • 𝑁 , 𝑤 𝐶 -is the number of occurrences of the word 𝑤 in the class 𝐶 • 𝑁 𝐶 -is the total number of words in the class 𝐶 • 𝑉 -denotes the number of unique words in all classes</p><p>The conditional probability of a text 𝑇 for a class 𝐶 is calculated as: where:</p><p>• 𝑃 (𝐶) is the prior probability of the class 𝐶.</p><p>• 𝑃 (𝑤|𝐶) is the probability of the word 𝑤 occurring in the class 𝐶.</p><p>Logarithmization allows for summing the logarithms of probabilities instead of multiplying the probabilities themselves, which is numerically more stable.</p><p>With Laplace smoothing, if a word 𝑤 does not appear in the variable storing the probabilities of word occurrences for the class 𝐶, we use: where:</p><p>• total_count is the sum of all word occurrences in that class plus 1 for each word (smoothing). • len(self.vocab) is the number of unique words in all dictionaries.</p><formula xml:id="formula_1">𝑃 (𝑊 ) 𝑡 𝑜𝑡 𝑎𝑙 _𝑐𝑜𝑢𝑛𝑡 • v is a vector • L(v)</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>is a length of the vector</head><p>The conditional probability</p><p>• 𝑃 (𝐶 ∧ 𝑊 ) -the joint probability of 𝐶 and 𝑊 occurring.</p><p>• 𝑃 (𝐶 | 𝑊 ) = 𝑃 ( 𝐶 ∧ 𝑊 ) -conditional probability: if 𝑊 has occurred, then the probability that 𝐶 has also occurred is 𝑃 ( 𝐶 | 𝑊 ).</p><p>where:</p><p>• 𝐶 denotes the class to which we assign the text.</p><p>• 𝑊 denotes the features of the text, i.e., the words in the text. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithms</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Database</head><p>The database was constructed based on the News Category Dataset available on the Kaggle platform. From the original database, the following were utilized: headline category and headline + short description -combined into one column of the table named "Text". Then we created 150 words long dictonaries consisting most commonly used vocabulary in texts, divided by categories: SPORTS, HOME&amp;LIVING, and FOOD&amp;DRINK. Using a Python program, the number of words occurring in the text matching the selected categories was calculated. Subsequently, appropriate columns describing these numbers were created (Table <ref type="table">1</ref>) Final database is made out of 1500 records.</p><p>Here are some of the records.</p><p>No. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Tests</head><p>Test for KNN:</p><p>The charts(         </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Comparing KNN with Naive Bayes Classifier</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion:</head><p>Most problematic category for both classifiers is category HOME&amp;LIVING. KNN's accuracy is overall a little better than Bayes's.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experiment 2:</head><p>Examine the accuracy and operation time of the program depending on the number of analyzed elements:</p><p>The entire database is taken, then it is randomly shuffied. The validation and training sets are created from the first n elements of the shuffied database. In KNN k=10.  The time required to perform KNN operations increases with the size of the database. Using Lagrange interpolation, the formula can be determined: </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>In conclusion, our comparative study of K-Nearest Neighbors (KNN) and Naive Bayes Classifier for text classification reveals distinct advantages and limitations of each method. KNN demonstrated superior accuracy, achieving an average accuracy of 99.02%, which significantly outperformed the Naive Bayes Classifier's 91.33%. However, this accuracy came at a cost, with KNN requiring considerably more computational time (173.89 seconds on average) compared to the much faster Naive Bayes (0.29 seconds on average). Overall, while KNN provides higher accuracy, its computational demands make it less suitable for large-scale applications compared to Naive Bayes, which offers a good balance of speed and accuracy. For applications where speed is critical and slight accuracy trade-offs are acceptable, Naive Bayes is preferable. Conversely, for scenarios demanding the highest possible accuracy and where computational resources are ample, KNN is the better choice. This study underscores the importance of selecting the appropriate classifier based on the specific requirements and constraints of the application at hand.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :3</head><label>1</label><figDesc>Figure 1: Ilustration of example</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Algorithm 2 : 6 𝑡𝑜𝑡𝑎𝑙_𝑐𝑜𝑢𝑛𝑡 3 : 5 foreach word 𝑤𝑜𝑟𝑑 in 𝑤𝑜𝑟𝑑𝑠 do 6 if 𝑤𝑜𝑟𝑑 is in 𝑣𝑜𝑐𝑎𝑏 then 7 𝑐𝑙𝑎𝑠𝑠_𝑠𝑐𝑜𝑟𝑒𝑠 8 1</head><label>2635678</label><figDesc>Calculating conditional probabilities of words. Data: Data dictionaries: 𝑑𝑖 𝑐𝑡𝑖 𝑜𝑛𝑎𝑟𝑖 𝑒𝑠 Result: Dictionary of word probabilities: 𝑤𝑜𝑟𝑑_𝑝𝑟𝑜𝑏𝑠 1 𝑤𝑜𝑟𝑑_𝑝𝑟𝑜𝑏𝑠 ← {}; 2 foreach class 𝑐𝑙𝑠, words 𝑤𝑜𝑟𝑑𝑠 in 𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑎𝑟𝑖𝑒𝑠 do 3 𝑤𝑜𝑟𝑑_𝑐𝑜𝑢𝑛𝑡𝑠 ← defaultdict with default value 1; 4 foreach word 𝑤𝑜𝑟𝑑 in 𝑤𝑜𝑟𝑑𝑠 do 5 𝑤𝑜𝑟𝑑_𝑐𝑜𝑢𝑛𝑡𝑠[𝑤𝑜𝑟𝑑] += 1; ← sum of values in 𝑤𝑜𝑟𝑑_𝑐𝑜𝑢𝑛𝑡𝑠; 7 𝑤𝑜𝑟𝑑_𝑝𝑟𝑜𝑏𝑠[𝑐𝑙 𝑠] ← {word 𝑤𝑜𝑟𝑑 : 𝑐 𝑜𝑢𝑛 𝑡 for 𝑤𝑜𝑟 , 𝑑 𝑐𝑜𝑢𝑛𝑡 in 𝑤𝑜𝑟𝑑_𝑐𝑜𝑢𝑛𝑡𝑠}; 8 return 𝑤𝑜𝑟𝑑_𝑝𝑟𝑜𝑏𝑠 Algorithm Predicting the class of a text. Data: Text: 𝑡𝑒𝑥𝑡 Result: Predicted class: 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑_𝑐𝑙 𝑎𝑠𝑠 1 𝑡𝑒𝑥𝑡 ← 𝑡𝑒𝑥𝑡 replace all punctuation with spaces, remove quotes, hyphens, exclamation marks, question marks, and apostrophes; 2 𝑤𝑜𝑟𝑑𝑠 ← split 𝑡𝑒𝑥𝑡 into words, convert to lowercase; 3 𝑐𝑙𝑎𝑠𝑠_ 𝑠𝑐𝑜𝑟𝑒𝑠 ← dictionary with initial values equal to the logarithm of pri or probabilities from 𝑐𝑙𝑎𝑠𝑠_𝑝𝑟𝑜𝑏𝑠; 4 foreach class 𝑐𝑙𝑠, word probabilities 𝑤𝑜𝑟𝑑_𝑝𝑟𝑜𝑏 in 𝑤𝑜𝑟𝑑_𝑝𝑟𝑜𝑏𝑠 do sum of values in 𝑤𝑜𝑟𝑑_𝑝𝑟𝑜𝑏+|𝑣𝑜𝑐𝑎𝑏|)) ;</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 .</head><label>2</label><figDesc>) depict the relationships between the characteristic features of individual abstraction classes and their intensity in the case of elements from other classes. The dispersion of elements has been presented, with colors corresponding to the following classes -text categories analyzed in this project: 'green' -'SPORTS', 'yellow' -'HOME &amp; LIVING', and 'pink' -'FOOD &amp; DRINK'.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Graphic reprezentation of database</figDesc><graphic coords="8,193.45,122.50,203.35,102.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 3 :Figure 4 :</head><label>34</label><figDesc>Figure 3: Accuracy for k in {2,3,4,5,6,7,8}</figDesc><graphic coords="8,192.10,450.30,204.20,120.45" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Time depending of number of records Conclusions of Figure 5.: Figure 5, which shows the classification execution time depending on the number of data rows, shows the expected dependence of the time on the number of rows. The more rows the algorithm tells you to classify, the longer it takes it. Measurements were performed every 100 added rows. The shortest time was for 100 and 200 records, and the longest time was for all of 1500 records of database.</figDesc><graphic coords="9,195.90,193.30,201.15,149.05" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 6 : 4 . 3 . 1 :</head><label>6431</label><figDesc>Figure 6: Accuracy depending of number of records Conclusions of Figure 6.: Figure 4 shows an interesting relationship. The lowest accuracy was recorded for data consisting of 100 rows -89%, while the highest for 400 rows -almost 95%. From data consisting</figDesc><graphic coords="9,192.10,446.60,208.35,164.40" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: Accuracy and operation time for k in {10,20,30,...,250} Conclusions of Figure 7.: As the value of k increases, the accuracy of the program decreases.The trend line can be described by the formula: 1 𝑥 + 99. The decrease in accuracy for k belonging to {10,20,30...,250} is slight -about 3%. The time required to perform the classification for most k values oscillates between 170 s and 175 s. The standard deviation is: 2.065 s Thus, it can be stated that the time required to execute the test(k) function is independent of the value of k.</figDesc><graphic coords="10,192.10,187.45,206.90,148.50" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 8 :</head><label>8</label><figDesc>Figure 8: Accuracy for k in {100,200,...,1000} Conclusions of Figure 8.: Considering larger k values, a sharp drop in classifier accuracy can be observed for k in the range (700,800). When k is too large, the algorithm begins to average predictions based on a very large number</figDesc><graphic coords="10,199.35,467.75,194.75,105.20" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_10"><head>Figure 9 :Figure 10 :</head><label>910</label><figDesc>Figure 9: Confusion matrix for Naive Bayes Classifier Conclusions of Figure 9.: It is noticeable that the algorithm correctly classifies texts in the vast majority of cases. Most often, it confuses the FOOD &amp; DRINK and SPORTS categories with the HOME &amp; LIVING category. This may be due to the dictionaries being insufficiently long or the data being too</figDesc><graphic coords="11,224.65,398.70,191.60,164.05" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_11"><head>Figure 11 :</head><label>11</label><figDesc>Figure 11: Accuracy for n(size of dataset) in {100,200,...,1500}</figDesc><graphic coords="12,193.45,148.65,208.35,77.80" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_12"><head>Figure 12 :</head><label>12</label><figDesc>Figure 12: KNN operation time for n(size of dataset) in {100,200,...,1500}</figDesc><graphic coords="12,193.45,342.20,206.35,85.50" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_13"><head>Figure 13 :</head><label>13</label><figDesc>Figure 13: Green line -chart of dependency of time on n Blue line -polynomial obtained by interpolation</figDesc><graphic coords="12,193.45,513.25,206.80,120.20" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 2</head><label>2</label><figDesc>Graphic representation of our database -second part</figDesc><table><row><cell></cell><cell>Text</cell><cell></cell><cell cols="2">Category</cell><cell>Words</cell></row><row><cell>1</cell><cell>maury wills...</cell><cell></cell><cell cols="2">SPORTS</cell><cell>183</cell></row><row><cell>2</cell><cell cols="2">boston marathon...</cell><cell cols="2">SPORTS</cell><cell>168</cell></row><row><cell>3</cell><cell>nfl rookie...</cell><cell></cell><cell cols="2">SPORTS</cell><cell>186</cell></row><row><cell>4</cell><cell>10 movies...</cell><cell></cell><cell cols="2">HOME &amp; LIVING</cell><cell>102</cell></row><row><cell>5</cell><cell cols="4">organic gardening... HOME &amp; LIVING</cell><cell>112</cell></row><row><cell>6</cell><cell cols="2">hiring a cleaning...</cell><cell cols="2">HOME &amp; LIVING</cell><cell>123</cell></row><row><cell>Table 1</cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="2">Graphic representation of our database -first part</cell><cell></cell><cell></cell></row><row><cell cols="2">No. Food words</cell><cell cols="2">Sports words</cell><cell>Home words</cell></row><row><cell>1</cell><cell>2</cell><cell></cell><cell>22</cell><cell>2</cell></row><row><cell>2</cell><cell>0</cell><cell></cell><cell>14</cell><cell>1</cell></row><row><cell>3</cell><cell>0</cell><cell></cell><cell>21</cell><cell>0</cell></row><row><cell>4</cell><cell>1</cell><cell></cell><cell>2</cell><cell>14</cell></row><row><cell>5</cell><cell>3</cell><cell></cell><cell>2</cell><cell>12</cell></row><row><cell>6</cell><cell>2</cell><cell></cell><cell>0</cell><cell>13</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_0">𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑_𝑐𝑙𝑎𝑠𝑠 ← class with the highest score in 𝑐𝑙𝑎𝑠𝑠_𝑠𝑐𝑜𝑟𝑒𝑠;</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Combination of Naïve Bayes Classifier and K-Nearest Neighbor (cNK) in the Classification Based Predictive Models</title>
		<author>
			<persName><forename type="first">Elma</forename><forename type="middle">&amp;</forename><surname>Ferdousy</surname></persName>
		</author>
		<author>
			<persName><surname>Islam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Md &amp; Matin</surname></persName>
		</author>
		<idno type="DOI">6.10.5539/cis.v6n3p48</idno>
	</analytic>
	<monogr>
		<title level="j">Computer and Information Science</title>
		<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A Comparison of Text Classification Methods k-NN, Naïve Bayes, and Support Vector Machine for News Classification</title>
		<author>
			<persName><forename type="first">Yohan</forename><surname>Muliono</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fidelson</forename><surname>Tanzil</surname></persName>
		</author>
		<idno type="DOI">10.30591/jpit.v3i2.828</idno>
	</analytic>
	<monogr>
		<title level="j">Jurnal Informatika: Jurnal Pengembangan IT</title>
		<imprint>
			<biblScope unit="volume">3</biblScope>
			<biblScope unit="page" from="157" to="160" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">An effective refinement strategy for KNN text classifier</title>
		<author>
			<persName><forename type="first">Songbo</forename><surname>Tan</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.eswa.2005.07.019</idno>
	</analytic>
	<monogr>
		<title level="j">Expert Systems with Applications</title>
		<imprint>
			<biblScope unit="volume">30</biblScope>
			<biblScope unit="page" from="290" to="298" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Tackling the Poor Assumptions of Naive Bayes Text Classifiers</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D M</forename><surname>Rennie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Shih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Teevan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">R</forename><surname>Karger</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1973">1973. 2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Naive (Bayes) at forty: The independence assumption in information retrieval</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Lewis</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ECML &apos;98</title>
				<meeting>ECML &apos;98</meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">A re-examination of text categorization methods</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of SIGIR &apos;99</title>
				<meeting>SIGIR &apos;99</meeting>
		<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Text categorization with support vector machines: Learning with many relevant features</title>
		<author>
			<persName><forename type="first">T</forename><surname>Joachims</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ECML &apos;98</title>
				<meeting>ECML &apos;98</meeting>
		<imprint>
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Text categorization based on regularized linear classification methods</title>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">J</forename><surname>Oles</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Retrieval</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="5" to="31" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A Survey of kNN Algorithm</title>
		<author>
			<persName><forename type="first">Jingwen</forename><forename type="middle">&amp;</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Weixing</forename><surname>Du</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Niancai</forename><surname>Shi</surname></persName>
		</author>
		<idno type="DOI">.1.10.18063/ieac.v1i1.770</idno>
	</analytic>
	<monogr>
		<title level="j">Information Engineering and Applied Computing</title>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">AN EFFICIENT TEXT CLASSIFICATION USING KNN AND NAIVE BAYESIAN</title>
		<author>
			<persName><surname>Jayaprakash</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Sreemathy &amp; Balamurugan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal on Computer Science and Engineering</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="page" from="392" to="396" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">An Empirical Study of Distance Metrics for k-Nearest Neighbor Algorithm</title>
		<author>
			<persName><forename type="first">Kittipong</forename><forename type="middle">&amp;</forename><surname>Chomboon</surname></persName>
		</author>
		<author>
			<persName><surname>Chujai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Pongsakorn</forename><forename type="middle">&amp;</forename><surname>Pasapitch &amp; Teerarassammee</surname></persName>
		</author>
		<author>
			<persName><surname>Kerdprasop</surname></persName>
		</author>
		<author>
			<persName><surname>Kittisak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Nittaya</forename><surname>Kerdprasop</surname></persName>
		</author>
		<idno type="DOI">10.12792/iciae2015.051</idno>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="280" to="285" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
