Comparison of classifiers for lung cancer prediction*
                                Kamil Jędrzkiewicz1,∗,†, Adam Kaszubowski1,† and Mateusz Goik1,†
                                1
                                Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44100 Gliwice, POLAND


                                                Abstract
                                                In this article, we present the program we have developed for lung cancer detection. For making predic-
                                                tions, it uses comprehensive patient information, including gender, age, smoking habits, yellow fingers,
                                                anxiety, peer pressure, chronic disease, fatigue, presence of allergies, wheezing, alcohol consumption,
                                                coughing, shortness of breath, difficulty swallowing, and chest pain. We start by providing a thorough
                                                analysis of the database to identify which features have the most significant impact on the likelihood of
                                                developing lung cancer. This includes statistical evaluations and visualizations to better understand the
                                                data distribution and correlations between various attributes and lung cancer incidence. Next, we present the
                                                results of implementing several different classifiers on the dataset. Through this comparative analysis, we
                                                demonstrate that, after preliminary tests, the naive Bayes algorithm emerges as the most effective
                                                classifier. We provide the pseudocode for the naive Bayes algorithm, offering a clear and accessible
                                                explanation of its implementation. Additionally, we conduct a detailed analysis of its effectiveness,
                                                supported by charts and graphs that illustrate the algorithm’s accuracy and other relevant performance
                                                metrics. Furthermore, we highlight the process of feature selection. By removing irrelevant from the
                                                database, we are able to enhance the program’s speed and accuracy.
                                                Keywords
                                                Lung cancer, Disease detection, Naive Bayes algorithm, Healthcare


                                1. Introduction
                                Lung cancer remains one of the most lethal forms of cancer worldwide[1]. It is difficult to
                                detect in its early stages because its symptoms are very subtle[2]. Fortunately, thanks to the
                                advancements in machine learning algorithms we are now able to improve early detection and
                                diagnosis of this disease to improve patient outcomes. This approach has already worked well with
                                several other kinds of sicknesses such as heart diseases[3], diabetes[4], prostate cancer[5] and breast
                                cancer[6]. In this article, we introduce a cutting-edge program developed for the detection of
                                lung cancer, leveraging the capabilities of machine learning. Utilizing a wide range of patient
                                information—such as gender, age, smoking habits, and other health indicators. Our program
                                employs a naive Bayes algorithm to predict the likelihood of lung cancer with notable accuracy.
                                   This study provides an in-depth analysis of the data features that significantly influence lung
                                cancer risk, offering insights into their relevance and impact. We compare the performance of
                                various classifiers and demonstrate why the naive Bayes algorithm stands out as the most
                                effective after initial testing.[7] Detailed pseudo-code and performance metrics are presented to elucidate
                                the algorithm’s efficiency and robustness.


                                *IVUS2024: Information Society and University Studies 2024, May 17, Kaunas, Lithuania
                                1,∗
                                      Corresponding author
                                †
CEUR
                  ceur-ws.org
                                      These author contributed equally.
Workshop
Proceedings
              ISSN 1613-0073
                                       kj307872@student.polsl.pl (K. Jędrzkiewicz); adamkas324@student.polsl.pl (A. Kaszubowski); mg307866@student.polsl.pl (M. Goik)
                                       0000-0000-0000-0000 (K. Jędrzkiewicz); 0000-0000-0000-0000 (A. Kaszubowski); 0000-0000-0000-0000 (M. Goik)
                                               ©️ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
   Furthermore, we explore the process of refining the dataset by eliminating unnecessary
information, which enhances both the speed and accuracy of the predictions. This article
not only showcases the technical aspects of our program but also emphasizes its potential to
revolutionize lung cancer diagnosis, offering a valuable tool for healthcare professionals in the
fight against this devastating disease.

2. Methodology
In order to choose the classifier that best suits our task, we have conducted a test of three
popular algorithms: k nearest neighbours classifier, naive Bayes classifier and decision tree
classifier. Each of the algorithms has been run 500 times, each time with random training and
test dataset. Then, we have calculated mean accuracy for all of classifiers and compared their
results.

2.1. KNN Classifier
KNN (k-nearest neighbors) is one of the most basic and popular classification algorithm. It
measures the distance between the new sample and all points in the training set, identifies the K
nearest neighbors, and assigns the most common class label among these neighbors to the new
sample.[8] In our project, we used the Euclidean metric to calculate the distance.
We tested for k=2,3,4,5,6,7 and the best was k = 3 and k = 5, with k = 2 being by far the worst. 2


  where:

    • 𝑥𝑖: The 𝑖-th coordinate of the point x.
    • 𝑦𝑖: The 𝑖-th coordinate of the point y.

2.2. Naive Bayes Classifier
The Gaussian Naive Bayes classifier works by classifying a sample based on the probabilities of
each class given the feature values, assuming that features follow a Gaussian (normal)
distribution. It calculates the likelihood of the sample’s features for each class, combines these with
the prior probabilities of the classes, and assigns the class with the highest resulting
probability to the sample.
We decided on the Gaussian Naive Bayes because it had the highest efficiency 2
The formula for the conditional probability of a feature 𝑥𝑖 given class 𝑦 is:
     where:

      • 𝑃 (𝑥𝑖|𝑦): The conditional probability of feature 𝑥𝑖 given class 𝑦.
      • 𝜎𝑖,𝑦: The standard deviation of feature 𝑥𝑖 in class 𝑦. It measures the spread of the feature
        values around the mean.
      • 𝜇𝑖,𝑦: The mean (average) of feature 𝑥𝑖 in class 𝑦. It represents the central value of the
        feature for the given class.
      • 𝑥𝑖: The value of the i-th feature.
      • 𝑦: The class label.


 Algorithm 1: Gaussian Naive Bayes
  Data: training data, object to classify
  Result: class to which the object belongs
1 groups = split training data into groups according to their class;
2 best_class = "";
3 best_score = 0;
4 for group in groups do
5     score = log(number of rows in group/number of rows in all training data);
6     for column in group do
7         std = standard deviation for column;
8         mean = mean for column;
9         x = value of column from object to classify;
10            col_score =
11            score +=
12      if score > best_score then
13           best_score = score;
14           best_class = class of group group;
15 return best_class


2.3. Decision Tree Classifier
The Decision Tree classifier works by recursively splitting the dataset into subsets based on
feature values, creating a tree structure where each node represents a feature and each branch
represents a decision rule. It continues splitting until the subsets are as pure as possible, meaning they
contain samples predominantly from one class. The class label assigned to a new sample is
determined by traversing the tree according to the sample’s feature values until reaching a leaf
node, which represents the predicted class. [9]
3. Experiments
3.1. Dataset Description
Our database consists of 16 columns and 309 rows. Individual information includes information about
the patient such as gender, age, smoking, yellow fingers, anxiety, peer pressure, chronic disease,
fatigue, allergy, wheezing, alcohol consuming, coughing, shortness of breath, swallow- ing
difficulty, chest pain and lung cancer, which tells us whether the person has cancer. A value of 1
means that the patient does not have a given symptom and 2 means that he does.
We made a correlation matrix. We were most interested in the last row to find out which
symptoms have a positive correlation with lung cancer. From it we can conclude that smoking and
shortness of breath have the lowest correlation (but still positive) while allergy and alcohol
consuming have the highest correlation.


Figure 1: Correlation matrix
3.2. Testing
We compared 4 classifiers to check which one would work best for our data. We used K Nearest
Neighbours, Decision Tree, Gaussian Naive Bayes and Multinomial Naive Bayes. As you can
see in the figure 2, the Gaussian Naive Bayes has the highest accuracy.


Figure 2: Algorithms accuracy


We then removed one column and checked how its removal would affect the accuracy of the
classifier. The differences were negligible, so we decided to remove several columns at once.
The best results were obtained after removing columns such as: ’WHEEZING’, ’SWALLOWING
DIFFICULTY’, ’AGE’, ’COUGHING’, ’SMOKING’ where the accuracy of the model averaged
91.28%, and for the best sample of 500 was 100%


Figure 3: Average results after removal of a given symptom
3.3. Results Analysis
We also created an error matrix for each classifier and calculated: Accuracy, Recall, Precision, F1
and Specificity[10].
  The following values were calculated from the formulas:

    • Accuracy - determines what part of all classified texts was classified correctly


    • Recall - determines the share of correctly predicted positive cases (TP) among all positive cases


    • Precision - determines how many of the examples predicted positively are actually positive


    • F1 - is the harmonic mean between precision and recall. The closer it is to one, the better it
      proves about the classification algorithm.


    • Specificity - determines how often the model accurately predicted falsehood when some-
      thing was actually false


The meaning of symbols:

    • TP - the sick person was correctly classified
    • TN - a healthy person has been correctly classified
    • FP - the sick person was classified as healthy
    • FN - a healthy person has been classified as sick


Table 1
Analyze Results
 Classifier                   Accuracy     Recall    Precision     F1     Specificity
 KNN(5)                          0.89        0.98        0.90      0.94       0.18
 Gaussian Naive Bayes            0.90        0.93        0.94      0.94       0.55
 Multinomial Naive Bayes         0.88        1.00        0.88      0.94       0.00
 Decision Tree                   0.88        0.93        0.94      0.94       0.55
Figure 4: Confusion matrix


4. Conclusion
In conclusion, our study presents a novel approach to lung cancer detection through the
integration of machine learning algorithms and comprehensive patient data analysis.
Our research highlights the importance of feature selection in optimizing algorithm performance,
leading to improved prediction accuracy and efficiency. Through comparative analysis and
detailed evaluation, we have demonstrated the superiority of the naive Bayes algorithm in this
context.
By facilitating early detection and intervention, our approach has the potential to significantly
improve patient outcomes and contribute to the ongoing efforts to combat this deadly disease.
References
 [1] J. Malhotra, M. Malvezzi, E. Negri, C. La Vecchia, P. Boffetta, Risk factors for lung cancer
     worldwide, European Respiratory Journal 48 (2016) 889–902.
 [2] R. L. Krech, J. Davis, D. Walsh, E. B. Curtis,                         Symptoms of lung
      cancer,            Palliative Medicine 6 (1992) 309–315. URL: https://doi.
      org/10.1177/026921639200600406.                    doi:10.1177/026921639200600406.
      arXiv:https://doi.org/10.1177/026921639200600406.
 [3] H. Arghandabi, P. Shams, A comparative study of machine learning algorithms for the
      prediction of heart disease, International Journal for Research in Applied Science and
      Engineering Technology 8 (2020) 677–683.
 [4] A. Mujumdar, V. Vaidehi, Diabetes prediction using machine learning algorithms, Proce-
      dia Computer Science 165 (2019) 292–299. URL: https://www.sciencedirect.com/science/
      article/pii/S1877050920300557. doi:https://doi.org/10.1016/j.procs.2020.01. 047, 2nd
      International Conference on Recent Trends in Advanced Computing ICRTAC
     -DISRUP - TIV INNOVATION , 2019 November 11-12, 2019.
 [5] M. M. I. Molla, J. Jui, H. Rana, N. Podder, Machine Learning Algorithms for the
      Prediction of Prostate Cancer, 2023, pp. 471–482. doi:10.1007/978-981-19-7528-8_37.
 [6] M. Amrane, S. Oukid, I. Gagaoua, T. Ensarİ, Breast cancer classification using machine
      learning, in: 2018 Electric Electronics, Computer Science, Biomedical Engineerings’
      Meeting (EBBT), 2018, pp. 1–4. doi:10.1109/EBBT.2018.8391453.
 [7] E. M. E. F. Christian Dwi Suhendra, Effan Najwaini, A machine learning perspective on
      daisy and dandelion classification: Gaussian naive bayes with sobel, Indonesian Journal of
      Data and Science 4 (2023) 151–159.
 [8] X. Mu, Implementation of music genre classifier using knn algorithm, Highlights in
      Science Engineering and Technology 34 (2023) 149–154.
 [9] V. V. Karnika Dwivedi, Hari Om Sharan, Analysis of decision tree for diabetes prediction,
      International Journal of Engineering and Technical Research (IJETR) 9 (2019) 3–6.
[10] B. Juba, H. S. Le, Precision-recall versus accuracy and the role of large data sets, in:
     Proceedings of the AAAI conference on artificial intelligence, volume 33, 2019, pp. 4039–
     4048.