Text Classification*
                                Szymon Molitorys1,∗,†, Bartłomiej Piłot1,†, Kacper Smyrak1,†
                                1
                                Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, POLAND


                                                 Abstract
                                                 In order to achieve our project objectives, we developed a text classifier using datasets derived from a
                                                 collection of books. Each dataset comprises selected words along with their frequencies of occurrence. Our
                                                 objective was to evaluate the performance of three different algorithms in classifying texts based on these
                                                 datasets. To this end, we employed a decision tree, a Naive Bayes classifier, and a custom probability calculation
                                                 algorithm to compare the accuracy and effectiveness of these methods. By testing our models on both
                                                 larger and smaller datasets, we sought to understand the impact of dataset size on classification results.
                                                 Our approach involved training the algorithms on a training dataset and then testing their
                                                 performance on a separate test dataset to assess their predictive accuracy. The results demonstrated
                                                 varying degrees of success across the algorithms, highlighting the strengths and weaknesses of each
                                                 method in different scenarios. This comprehensive comparison provided invaluable insights into the
                                                 practical applications of text classification techniques and their potential for further refinement and use in
                                                 real-world data analysis tasks.
                                                     [1, 2, 3, 4, 5, 6]
                                                  Keywords
                                                 text classification, decision tree, Bayesian algorithm, custom probability algorithm, dataset comparison,
                                                 word frequency analysis, machine learning, data analysis


                                1. Introduction
                                Data Preparation We begin by loading the training and testing datasets using the pandas library. The
                                datasets contain three columns: "Word" (the word itself), "Frequency" (the number of times the word
                                appears), and "Type" (the category to which the word belongs). Decision Tree: The objective is
                                to develop a text classifier using the Decision Tree algorithm. The aim is to train a model that can
                                accurately predict the type of a given word based on its frequency. This approach leverages the
                                structure of decision trees to handle the categorical nature of our data effectively.

                                2. Methodology
                                2.1. Decision Tree Algorithm
                                Firstly in our Decision Tree Algorithm, words were mapped to integers. Since the Decision
                                Tree algorithm requires numerical input, a dictionary was created that maps each unique word in
                                the training dataset to a unique integer. This mapping allows us to convert the words into
                                numerical format. Words in the testing dataset that do not appear in the training dataset are
                                mapped to a value of -1, ensuring that the model can handle unseen words without causing
                                errors.


                                *IVUS2024: Information Society and University Studies 2024, May 17, Kaunas, Lithuania
                                1,∗
CEUR
                  ceur-ws.org         Corresponding author
Workshop                        †
                                      These author contributed equally.
              ISSN 1613-0073
Proceedings

                                       sm305126@student.polsl.pl (S. M. B. P. K. S. )
                                                ©️ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
Both the training and testing datasets are split into two sets: features (X) and labels (y). The
features include the word (now represented as an integer) and its frequency, while the labels are
the types. The Decision Tree classifier is initialized with specific parameters:

    • Criterion: The entropy criterion is employed to assess the quality of a split, utilizing
      information gain.

   The maximum depth of the tree is tested with values of 5, 10, and 15, to observe the impact on model
performance. The accuracy of the model is evaluated using the accuracy score function from
scikit-learn, which compares the predicted types to the actual types in the testing dataset.
   Mathematical Entropy
   Entropy is a measure of randomness or impurity in the data. It quantifies the uncertainty
involved in predicting the class of a given dataset. In the context of decision trees, entropy is
used to determine the best way to split the data at each node.
   The formula for entropy (H) is:


where 𝑃 (𝑥𝑖) is the probability of occurrence of class 𝑥𝑖.
   In our text classification problem, entropy helps us to measure the impurity of our dataset
with respect to the types of words. By minimizing entropy at each split, the decision tree aims to
create the most informative branches, ultimately improving the model’s classification accuracy.

2.2. Comparison of the number of words in the text
First we count the number of unique words in the text and then compare it with the number
in the data set. The category with the most shared words with text is likely to be the answer.
There are many ways to do this kind of comparison.
   We can simply increment the counter in each category according to the number of times we
come across the same word in the text, and then add up the number of occurrences. This has
many advantages, it is fast and reliable. However, the algorithm can be fooled by texts that use a
lot of specific words that are not related to the category. Moreover, the probability of a good
answer depends on the prepared data set. The preparation of a good one is not a difficult task,
but the preparation of the best one is very difficult. Therefore, our choice of method is the less
difficult one.
   There are other ways to calculate the number for each category, but the best results are
still obtained by using neural networks, which usually require a lot more calculations and are
forbidden by our supervisor.

2.3. Naive Bayes Classifier using Bag-of-words method
For the Naive Bayes Classifier we have used the Bag-of-words to prepare the data for the analysis. The
algorithm will use the training dataset as the basis for the calculation of the probabilities of


                                                                                                     2
test sets. The Bag-of-words method allows us to use discrete values instead of continuous values by
counting the occurrence of unique words in the text without regard for their alphabetical or word
order. The goal of this algorithm is to classify the document based of the highest value of
conditional probability of chosen text 𝑥 to be of class 𝐶𝑗 which is based on the formula


(1)      which derived from Bayes’ theorem. The 𝑃 (𝐶𝑗) is the probability of the choosing the
class from all the classes based on the number of documents for each class in comparison to the total
number of documents, 𝑃 (𝑥 | 𝐶𝑗) is our likelihood of choosing text 𝑥 given we have class 𝐶𝑗. The
probability 𝑃 (𝑥) is constant as we are not manipulating the contents of datasets and is the
probability of choosing the text 𝑥 from all the available texts. Because of that 𝑃 (𝑥) will not
affect the results of our calculations and can be omitted. Consequently the probability we
want to calculated can be expressed as following using joint probability.


Now that we are calculating for each of the "feature" of the text which is for each word we can
transform our current equation. Moreover since our classifier is "naive" we assume that each
probability is independent from one another, giving use the following equation, substituting the
text with its words denoted by 𝑤𝑖


By assuming that probabilities are independent


we have


(2)    giving us


Because of the insignificance of 𝑃 (𝑥) in equation (1) we will therefore calculate our probability of
text 𝑥 being of class 𝐶𝑗 by calculating 𝑃 (𝐶𝑗, 𝑥1, 𝑥2, ..., 𝑥𝑛) and choosing the class for which the
highest value of probability we will achieve.

  All 3 datasets are first downloaded. The base training set is then grouped by the class of the
document the words belong to and the sorted in descending order. The test datasets are also
                                                                                                   3
sorted in descending order for each document. the sorting of the data sets allows us to see the
patterns in the occurrence of words for each text and text class more clearly.
   In order to begin we choose the text from the test dataset. Each of the texts will be checked
for each of the available classes of text. After we choose the class we want to calculate the
probability for we have to calculate the probability of choosing of each of the classes 𝑃 (𝐶𝑗)
(which has been denoted in code as Pclass) of the document using the following formula,
where j indicates the index of the class.


In our case the number of documents of each class is even and we are working with three
possible classes. As a result the probability 𝑃 (𝐶𝑗) is equal 1/3. This probability will be our prior
probability needed for the calculating the main Next we need to find the conditional
probability
of the features (words) of the text which is denoted by              from equation (2).
 To do
this we need to know the probability of each word given they are in class 𝐶𝑗. We will calculate
each of the probabilities 𝑃 (𝑥𝑖 | 𝐶𝑗) using formula


(3)     where 𝑁𝑖 is the number of words that occurred in test text from the base dataset(denoted
in code as CountWordBase) and 𝑁𝐶𝑡𝑜𝑡𝑎𝑙 is the total number of words in the dataset
belonging to the chosen class (denoted in code as CounttotalBase).
Despite it being a correct formula for each factor needed to calculate our main probability a
certain problem occurs. Since the test datasets and base datasets may not have the same set of
words the resultant probability for a word that does not occur in base set will be equal zero.
Consequently the value of the probability of𝑃 (𝐶𝑗, 𝑥1, 𝑥2, ..., 𝑥𝑛) will also be zero, falsifying our
results. To counter this we will use modified formula (3) that uses Laplace smoothing in order
to get rid of such problems. To use it we need to increment the numerator by certain value 𝛼
(in this case we chose the value 𝛼 = 1) and denominator by value 𝛼 | 𝑉 | (where | 𝑉 | is the total
number of unique words in dataset, denoted in code as (len(BaseClass))) Consequently the
modified formula (3) looks as following


(4)     In code we implemented the +1 in numerator by setting the default value of
CountWord- Base as 1.
The second problem we have encountered with the formula is the fact that for the num- ber
of words in our texts and bases which reach more than 1000 the resultant probability
𝑃 (𝐶𝑗, 𝑥1, 𝑥2, ..., 𝑥𝑛) reaches fraction of such small order of magnitude that they cannot be
stored in the computer using standard variable formats such as float or double. to counter this we
had to modify the formula for the 𝑃 (𝐶𝑗, 𝑥1, 𝑥2, ..., 𝑥𝑛) so that it will result in values that are
storable and computable using standard variable formats such as float or double. To do

                                                                                                   4
this we have to express we took the logarithm of both sides. Because of the properties of the
logarithm we can convert our multiplication into sum

                                        𝑙𝑛(𝑎 * 𝑏) = 𝑙𝑛(𝑎) + 𝑙𝑛(𝑏)
 Because the conditional probabilities for each word are in range between [0, 1] the log probabili- ties
 will have values in range [− inf, 0] but despite the possibility of reaching really high values the
 resultant log probabilities can be store in float and double variable format due to their low order
 of magnitude (in our test the values for log probabilities were reaching values between
-1000 and -10000) which solved our problem. As for value of log probabilites and keeping the
 relation in term of greatness between them, those features are still preserved. According to the
 properties of 𝑙𝑛(𝑥) function the values of greater value are less negative after using the 𝑙𝑛(𝑥)
 function on them. This means that we can use the log probabilities instead of the ordinary ones
 calculated from equation (2). The resultant equation for log probabilities using the equation
 (4) is as following


after calculating the log probability we will have to choose the the highest value of the probability that
were calculated for all available classes for chosen text. The class that has the highest value of
probability (in this case least negative) will be chosen as the class of the test text. From the
probability model we can write the function that uses our Naive Bayes classifier to asses the
class of the test text 𝑦ˆ .


3. Experiments
3.1. Dataset Description
We utilized three distinct datasets in our experiments: one training dataset and two test datasets. These
datasets were constructed from books and articles referenced in our study, with the words counted
using a specially prepared Python script. Each dataset comprises between 2000 and 5000 rows of
data. The books were evenly distributed among categories.
   In our database, each entry is characterized by the word itself, its frequency of occurrence, and
its category. This structured approach allowed us to conduct thorough experiments on our
algorithm’s performances across different datasets and categories.

3.2. Experimentation with Max Depth in Decision Tree
To find the optimal complexity of our Decision Tree model, we experimented with different
values for the maxdepth parameter: 5, 10, and 15. This parameter controls the maximum

                                                                                                        5
depth of the tree, influencing how well the model generalizes to unseen data. We approach the
experiment with thesis that:
• Max Depth 5: With a shallow tree, the model may underfit, failing to capture the complexity
of the data.
• Max Depth 10: This depth is a middle ground, potentially balancing the trade-off between
underfitting and overfitting.
• Max Depth 15: A deeper tree may overfit the training data, capturing noise rather than the
underlying pattern.


Figure 1: Decision Tree Comparison


It was surprisingly different. Depending on the database, the data behaved differently.
   When working with a larger dataset, it is generally beneficial to use a greater maximum
depth for the decision tree, as it allows the model to capture more complex patterns in the data.
Conversely, for smaller datasets, a shallower maximum depth is preferable to avoid overfitting and
ensure the model generalizes well to unseen data.

3.3. Experimentation with words occurance
For our experimental data, both of these algorithms have given the correct answers every time.
This means that the dataset is close enough to the tests we provide. Running time increases
when working with larger data sets. So good initial data is important. Testing with more
examples is needed.

3.4. Experimentation with Naive Bayes classifier
To correctly asses the class of each of the texts in test sets we were using the training set
as the base for calculating probabilities and assuming classes for the texts in test set 1 and
test set 2. Additionally since the test datasets also contained the information on were the
classes of the test texts we could use test sets 1 and 2 as a base instead of train- ing set. Below
is the example of calculations of one pair of datasets on which we ran our program

  i


                                                                                                 6
                       Text class name       Assumed class name         log probability value
                          History                 History                     -2635.23
                          History                  sport                      -6616.30
                          History                cooking                      -7078.77
                            sport                 History                     -3599.93
                            sport                  sport                      -1415.50
                            sport                cooking                      -3536.01
                          cooking                 History                     -3416.40
                          cooking                  sport                      -3136.54
                          cooking                cooking                      -1219.84
Table 1
Example of values for classification for base dataset "treningowa.csv" and test dataset "testowa1.csv"


   The column with label "Assumed class" contains the name of the class that was chosen by
the algorithm, the column with label "Text class" contains the true class name of the text. As we
can see the most probable classes for each texts have the highest (least negative) values. In this
case the algorithm calculated the highest values for correct classes for each text, resulting in
100% accuracy.
   The algorithm creates tables such as the example and the uses argmax function to choose the
result with the highest probability. Afterwards we can calculate the accuracy of the algorithm
for each pair of datasets. The accuracy results of the algorithm running on dataset pairs
specified before are as following Based on the values of accuracies that were presented in table

                       Base dataset name        Test dataset name       Accuracy percentage
                         treningowa               treningowa                   100%
                         treningowa                 testowa1                   100%
                         treningowa                 testowa2                   100%
                           testowa1                 testowa2                   100%
                           testowa2                 testowa1                   100%
Table 2
Table of accuracy of the Naive Bayes classifier for pairs of datasets

2 we can see that Naive Bayes classifier has correctly guessed the classes of every text in the
dataset. Therefore for the current amount of datasets used in the experiment the accuracy of our
algorithm is 100%. Moreover for each pair of datasets the log probability values of guesses were
significantly greater (more than 2 times in most cases) than for the incorrect guesses as seen in
the table 1.

4. Conclusion
When we look on decision tree, the objective of this study is to develop an effective model for
text classification based on word frequency. To this end, we have implemented a Decision Tree
classifier and conducted experiments with different tree depths. Based on the results of the
experiments as seen on the Figure 1 we can see that the decision tree algorithm is quite accurate


                                                                                                         7
in the term of text classification. Although the results of the classification vary depending on
the test article and the depth value we can see that the algorithm can reach satisfying accuracy of
over 50% when given the right depth value for the decision tree. Another thing is that the
lowest accuracy for this algorithm does not go below 30% for unadjusted depth values, meaning that
even if the algorithm is not fitted for certain types of articles it retains some accuracy for
assuming the right class to the right article. Since the results vary we can also speculate that
for different types of training sets and different values of decision tree depth the algorithm will
reach higher accuracy. Nevertheless those adjustments may vary since for the testowa2 dataset the
algorithm shows decreasing, linear trend for accuracy with increasing value of tree depth, while
for testowa1 the accuracy values show increasing, linear trend for increasing tree depth value
as seen on figure 1. This might prove finding the best settings for the highest accuracy
difficult.
   The second algorithm that we used which is the Naive Bayes classifier shows very high
accuracy of 100% for classifying the documents for every checked combination of the treningowa,
testowa1 and testowa2 datasets as shown on Table 2. For each test text of possible classes:
History, sport and cooking the algorithm correctly guessed the class of the test article. The
high accuracy and the correctness of the calculations of the algorithm are proven by the results
shown in Table 1. In mentioned table the values of the log probabilities for the correctly assumed
classes are significantly (almost two times) higher than for the incorrect guesses. This difference in
values shows that the algorithm correctly calculates the probability for two different texts that
vary in vocabulary range, word count and topic. The possibility that the algorithm chose the
correct class due to the random error in calculations or due to the very small difference in
probability is out of the question.
   By reevaluating our results we can see that the best algorithm for the text classification is the
Naive Bayes classifier, this algorithm has shown the highest accuracy of our algorithms and the
results of said classification using this algorithm did during the experiment. Nevertheless the
Decision tree algorithm also showed potential for high accuracy of classification of texts due to the
trends of the accuracy depending on the setting. Nonetheless the accuracy results for this
algorithm were significantly lower than for the Naive Bayes classifier. The highest accuracy
reached by the Decisive tree algorithm was 52% which is almost half of the accuracy of the
Naive Bayes classifier. Decisive tree algorithm however shows possibility for it to be improved.
The experiment could be improved by using bigger datasets with bigger number of articles.
Moreover the datasets could include more types of classes such as science, medicine, entertain-
ment or in case of bigger texts such as extracts from the book the book genres such as fantasy,
sci-fi etc. This could show us more accurately the behaviour of our algorithms when given
more input data. What is more, the bigger amount of information could show any possible
anomalies and outliers as well as the trends in the results, allowing us to improve our algorithms
accordingly. As for the Decisive tree algorithm we could check the results of classification
for higher value of max tree depth. This could show us if our assumptions of linear trend of
the classification accuracy were correct or not, giving us information on how to improve said
algorithm.


                                                                                                    8
References
The datasets utilized in this project were derived from a collection of books and various online
sources. For reference and further exploration, the links to these books and articles are provided in the
bibliography section below. These resources form the foundation of our text classification study,
offering a diverse range of textual data essential for our analysis.
     • History Of The Decline And Fall Of The Roman Empire, Edward Gibbon, Esq.
       With notes by the Rev. H. H. Milman. Complete Contents, Chapter XXI.
     • The European Experience: A Multi-Perspective History of Modern Europe,
       1500–2000.
     • The Most Amazing Chocolate Chip Cookies.
     • Liverpool vs. Tottenham Premier League Match Report.
     • Homemade Pizza Recipe.
     • NASCAR Cup Series at Kansas Speedway.
     • The Project Gutenberg eBook of Cleopatra, Chapter 1.
     • What the German Conscript Thinks by Arnold Bennett. Copyright, 1914, by The
       New York Times Company.
     • The History of Christianity by John S. C. Abbott, Chapter XIII.
     • How to Make a Blueberry Pie.
     • Rice Cooker Asian Chicken Rice, Grilled Duck Breast With Morels.
     • Kylian Mbappé on Final Season at PSG.
     • Arsenal Boss Mikel Arteta on the ‘Painful’ Chase of Manchester City.
     • How ‘Introverted’ Iga Świątek Became a Four-Time Grand Slam Champion.
  [1] B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu, "Text Classification by Labeling Words,"
       Proceedings of the 19th National Conference on Artificial Intelligence (AAAI-04), San Jose,
       California, USA, July 25-29, 2004.
  [2] S. Kotsiantis, D. Kanellopoulos, and P. Pintelas, "Handling imbalanced datasets: A review,"
       GESTS International Transactions on Computer Science and Engineering, vol. 30, no. 1, pp. 25-
       36, 2006.
  [3] T. Joachims, "Text Categorization with Support Vector Machines: Learning with Many
       Relevant Features," Proceedings of the 10th European Conference on Machine Learning
       (ECML-98), Chemnitz, Germany, April 21-23, 1998.
  [4] K. Nigam, A. K. McCallum, S. Thrun, and T. Mitchell, "Text Classification from Labeled
       and Unlabeled Documents using EM," Machine Learning, vol. 39, no. 2/3, pp. 103-134,
       2000.
  [5] J. R. Quinlan, "Induction of Decision Trees," Machine Learning, vol. 1, pp. 81-106, 1986.
  [6] D. Mladenic and M. Grobelnik, "Feature Selection for Unbalanced Class Distribution
       and Naive Bayes," Proceedings of the 16th International Conference on Machine Learning
       (ICML-99), Bled, Slovenia, June 27-30, 1999.


                                                                                                       9