<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparison of classifiers for lung cancer prediction*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kamil Jędrzkiewicz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adam Kaszubowski</string-name>
          <email>adamkas324@student.polsl.pl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mateusz Goik</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Applied Mathematics, Silesian University of Technology</institution>
          ,
          <addr-line>Kaszubska 23, 44100 Gliwice</addr-line>
          ,
          <country country="PL">POLAND</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IVUS2024: Information Society and University Studies 2024</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this article, we present the program we have developed for lung cancer detection. For making predictions, it uses comprehensive patient information, including gender, age, smoking habits, yellow fingers, anxiety, peer pressure, chronic disease, fatigue, presence of allergies, wheezing, alcohol consumption, coughing, shortness of breath, difficulty swallowing, and chest pain. We start by providing a thorough analysis of the database to identify which features have the most significant impact on the likelihood of developing lung cancer. This includes statistical evaluations and visualizations to better understand the data distribution and correlations between various attributes and lung cancer incidence. Next, we present the results of implementing several different classifiers on the dataset. Through this comparative analysis, we demonstrate that, after preliminary tests, the naive Bayes algorithm emerges as the most effective classifier. We provide the pseudocode for the naive Bayes algorithm, offering a clear and accessible explanation of its implementation. Additionally, we conduct a detailed analysis of its effectiveness, supported by charts and graphs that illustrate the algorithm's accuracy and other relevant performance metrics. Furthermore, we highlight the process of feature selection. By removing irrelevant from the database, we are able to enhance the program's speed and accuracy.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Lung cancer</kwd>
        <kwd>Disease detection</kwd>
        <kwd>Naive Bayes algorithm</kwd>
        <kwd>Healthcare</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Lung cancer remains one of the most lethal forms of cancer worldwide[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. It is difficult to
detect in its early stages because its symptoms are very subtle[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Fortunately, thanks to the
advancements in machine learning algorithms we are now able to improve early detection and
diagnosis of this disease to improve patient outcomes. This approach has already worked well with
several other kinds of sicknesses such as heart diseases[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], diabetes[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], prostate cancer[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and breast
cancer[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In this article, we introduce a cutting-edge program developed for the detection of
lung cancer, leveraging the capabilities of machine learning. Utilizing a wide range of patient
information—such as gender, age, smoking habits, and other health indicators. Our program
employs a naive Bayes algorithm to predict the likelihood of lung cancer with notable accuracy.
      </p>
      <p>
        This study provides an in-depth analysis of the data features that significantly influence lung
cancer risk, offering insights into their relevance and impact. We compare the performance of
various classifiers and demonstrate why the naive Bayes algorithm stands out as the most
effective after initial testing.[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] Detailed pseudo-code and performance metrics are presented to elucidate
the algorithm’s efficiency and robustness.
      </p>
      <p>Furthermore, we explore the process of refining the dataset by eliminating unnecessary
information, which enhances both the speed and accuracy of the predictions. This article
not only showcases the technical aspects of our program but also emphasizes its potential to
revolutionize lung cancer diagnosis, offering a valuable tool for healthcare professionals in the
fightagainst this devastating disease.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>In order to choose the classifierthat best suits our task, we have conducted a test of three
popular algorithms: k nearest neighbours classifier, naive Bayes classifier and decision tree
classifier.Each of the algorithms has been run 500 times, each time with random training and
test dataset. Then, we have calculated mean accuracy for all of classifiersand compared their
results.</p>
      <sec id="sec-2-1">
        <title>2.1. KNN Classifier</title>
        <p>
          KNN (k-nearest neighbors) is one of the most basic and popular classificationalgorithm. It
measures the distance between the new sample and all points in the training set, identifies the K
nearest neighbors, and assigns the most common class label among these neighbors to the new
sample.[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] In our project, we used the Euclidean metric to calculate the distance.
We tested for k=2,3,4,5,6,7 and the best was k = 3 and k = 5, with k = 2 being by far the worst. 2
• : The  -th coordinate of the point x.
        </p>
        <p>• : The  -th coordinate of the point y.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Naive Bayes Classifier</title>
        <p>The Gaussian Naive Bayes classifier works by classifying a sample based on the probabilities of
each class given the feature values, assuming that features follow a Gaussian (normal)
distribution. It calculates the likelihood of the sample’s features for each class, combines these with
the prior probabilities of the classes, and assigns the class with the highest resulting
probability to the sample.</p>
        <p>We decided on the Gaussian Naive Bayes because it had the highest efficiency 2
The formula for the conditional probability of a feature  given class  is:
•  (| ): The conditional probability of feature  given class  .
• ,: The standard deviation of feature  in class  . It measures the spread of the feature
values around the mean.
• ,: The mean (average) of feature  in class  . It represents the central value of the
feature for the given class.
• : The value of the i-th feature.</p>
        <p>•  : The class label.</p>
        <sec id="sec-2-2-1">
          <title>Algorithm 1: Gaussian Naive Bayes</title>
          <p>Data: training data, object to classify</p>
          <p>Result: class to which the object belongs
1 groups = split training data into groups according to their class;
2 best_class = "";
3 best_score = 0;
4 for group in groups do
5 score = log(number of rows in group/number of rows in all training data);
6 for column in group do
7 std = standard deviation for column;
8 mean = mean for column;
9 x = value of column from object to classify;
10
11</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Decision Tree Classifier</title>
        <p>
          The Decision Tree classifier works by recursively splitting the dataset into subsets based on
feature values, creating a tree structure where each node represents a feature and each branch
represents a decision rule. It continues splitting until the subsets are as pure as possible, meaning they
contain samples predominantly from one class. The class label assigned to a new sample is
determined by traversing the tree according to the sample’s feature values until reaching a leaf
node, which represents the predicted class. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset Description</title>
        <p>Our database consists of 16 columns and 309 rows. Individual information includes information about
the patient such as gender, age, smoking, yellow fingers, anxiety, peer pressure, chronic disease,
fatigue, allergy, wheezing, alcohol consuming, coughing, shortness of breath, swallow- ing
difficulty, chest pain and lung cancer, which tells us whether the person has cancer. A value of 1
means that the patient does not have a given symptom and 2 means that he does.
We made a correlation matrix. We were most interested in the last row to findout which
symptoms have a positive correlation with lung cancer. From it we can conclude that smoking and
shortness of breath have the lowest correlation (but still positive) while allergy and alcohol
consuming have the highest correlation.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Testing</title>
        <p>We compared 4 classifiers to check which one would work best for our data. We used K Nearest
Neighbours, Decision Tree, Gaussian Naive Bayes and Multinomial Naive Bayes. As you can
see in the figure2, the Gaussian Naive Bayes has the highest accuracy.
We then removed one column and checked how its removal would affect the accuracy of the
classifier. The differences were negligible, so we decided to remove several columns at once.
The best results were obtained after removing columns such as: ’WHEEZING’, ’SWALLOWING
DIFFICULTY’, ’AGE’, ’COUGHING’, ’SMOKING’ where the accuracy of the model averaged
91.28%, and for the best sample of 500 was 100%</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Results Analysis</title>
        <p>
          We also created an error matrix for each classifier and calculated: Accuracy, Recall, Precision, F1
and Specificity[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>The following values were calculated from the formulas:
• Accuracy - determines what part of all classified texts was classified correctly
• Recall - determines the share of correctly predicted positive cases (TP) among all positive cases
• Precision - determines how many of the examples predicted positively are actually positive
• F1 - is the harmonic mean between precision and recall. The closer it is to one, the better it
proves about the classification algorithm.
• Specificity - determines how often the model accurately predicted falsehood when
something was actually false</p>
        <sec id="sec-3-3-1">
          <title>The meaning of symbols:</title>
          <p>• TP - the sick person was correctly classified
• TN - a healthy person has been correctly classified
• FP - the sick person was classified as healthy
• FN - a healthy person has been classified as sick</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In conclusion, our study presents a novel approach to lung cancer detection through the
integration of machine learning algorithms and comprehensive patient data analysis.
Our research highlights the importance of feature selection in optimizing algorithm performance,
leading to improved prediction accuracy and efficiency. Through comparative analysis and
detailed evaluation, we have demonstrated the superiority of the naive Bayes algorithm in this
context.</p>
      <p>By facilitating early detection and intervention, our approach has the potential to significantly
improve patient outcomes and contribute to the ongoing efforts to combat this deadly disease.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Malhotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Malvezzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Negri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>La Vecchia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Boffetta</surname>
          </string-name>
          ,
          <article-title>Risk factors for lung cancer worldwide</article-title>
          ,
          <source>European Respiratory Journal</source>
          <volume>48</volume>
          (
          <year>2016</year>
          )
          <fpage>889</fpage>
          -
          <lpage>902</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Krech</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Walsh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. B.</given-names>
            <surname>Curtis</surname>
          </string-name>
          , Symptoms of lung cancer,
          <source>Palliative Medicine</source>
          <volume>6</volume>
          (
          <year>1992</year>
          )
          <fpage>309</fpage>
          -
          <lpage>315</lpage>
          . URL: https://doi. org/10.1177/026921639200600406. doi:
          <volume>10</volume>
          .1177/026921639200600406. arXiv:https://doi.org/10.1177/026921639200600406.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Arghandabi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shams</surname>
          </string-name>
          ,
          <article-title>A comparative study of machine learning algorithms for the prediction of heart disease</article-title>
          ,
          <source>International Journal for Research in Applied Science and Engineering Technology</source>
          <volume>8</volume>
          (
          <year>2020</year>
          )
          <fpage>677</fpage>
          -
          <lpage>683</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mujumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vaidehi</surname>
          </string-name>
          ,
          <article-title>Diabetes prediction using machine learning algorithms</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>165</volume>
          (
          <year>2019</year>
          )
          <fpage>292</fpage>
          -
          <lpage>299</lpage>
          . URL: https://www.sciencedirect.com/science/ article/pii/S1877050920300557. doi:https://doi.org/10.1016/j.procs.
          <year>2020</year>
          .
          <volume>01</volume>
          . 047, 2nd International Conference on Recent Trends in Advanced Computing ICRTAC -DISRUP
          <string-name>
            <surname>- TIV</surname>
            <given-names>INNOVATION</given-names>
          </string-name>
          ,
          <year>2019</year>
          November 11-
          <issue>12</issue>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. M. I.</given-names>
            <surname>Molla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Podder</surname>
          </string-name>
          ,
          <source>Machine Learning Algorithms for the Prediction of Prostate Cancer</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>471</fpage>
          -
          <lpage>482</lpage>
          . doi:
          <volume>10</volume>
          .1007/
          <fpage>978</fpage>
          -981-19-7528-8_
          <fpage>37</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Amrane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oukid</surname>
          </string-name>
          , I. Gagaoua, T. Ensarİ,
          <article-title>Breast cancer classificationusing machine learning</article-title>
          , in: 2018 Electric Electronics, Computer Science, Biomedical Engineerings'
          <source>Meeting (EBBT)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          . doi:
          <volume>10</volume>
          .1109/EBBT.
          <year>2018</year>
          .
          <volume>8391453</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E. M. E. F.</given-names>
            <surname>Christian Dwi</surname>
          </string-name>
          <string-name>
            <surname>Suhendra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Effan</given-names>
            <surname>Najwaini</surname>
          </string-name>
          ,
          <article-title>A machine learning perspective on daisy and dandelion classification: Gaussian naive bayes with sobel</article-title>
          ,
          <source>Indonesian Journal of Data and Science</source>
          <volume>4</volume>
          (
          <year>2023</year>
          )
          <fpage>151</fpage>
          -
          <lpage>159</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Mu</surname>
          </string-name>
          ,
          <article-title>Implementation of music genre classifier using knn algorithm</article-title>
          ,
          <source>Highlights in Science Engineering and Technology</source>
          <volume>34</volume>
          (
          <year>2023</year>
          )
          <fpage>149</fpage>
          -
          <lpage>154</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Karnika</surname>
          </string-name>
          <string-name>
            <surname>Dwivedi</surname>
          </string-name>
          ,
          <article-title>Hari Om Sharan, Analysis of decision tree for diabetes prediction</article-title>
          ,
          <source>International Journal of Engineering and Technical Research (IJETR) 9</source>
          (
          <issue>2019</issue>
          )
          <fpage>3</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Juba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Precision-recall versus accuracy and the role of large data sets</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>33</volume>
          ,
          <year>2019</year>
          , pp.
          <fpage>4039</fpage>
          -
          <lpage>4048</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>