=Paper=
{{Paper
|id=Vol-3338/ICCS_CVMLH_06
|storemode=property
|title=Breast Cancer Prediction by Machine Learning Algorithms - A Comparative Study of Naive Bayes, KNN and J48 in Weka Environment
|pdfUrl=https://ceur-ws.org/Vol-3338/ICCS_CVMLH_06.pdf
|volume=Vol-3338
|authors=Trishit Banerjee,Geetha Ganesan
}}
==Breast Cancer Prediction by Machine Learning Algorithms - A Comparative Study of Naive Bayes, KNN and J48 in Weka Environment==
<pdf width="1500px">https://ceur-ws.org/Vol-3338/ICCS_CVMLH_06.pdf</pdf>
<pre>
Breast Cancer Prediction by Machine Learning Algorithms - A
Comparative Study of Naive Bayes, KNN and J48 in Weka
Environment
Trishit Banerjee a, Geetha Ganesan b
a
    Netaji Subhash Engineering College, Techno City Garia Kolkata- 700152, India
b
    Advanced Computing Research Society, Chennai, India


                 Abstract
                 Breast cancer is considered one of the most common cancers occurring in women. Each year,
                 it affects 2.1 million women, causing cancer-related deaths. As per the estimation, breast
                 cancer took the lives of 627,000 women in 2018 alone. The disease is mostly observed in the
                 developed areas of the world, with current rates expanding in almost every region across the
                 world. The role of predicting cancer is crucial for the further progress of data mining tools
                 currently available. K-nearest neighbor, J48 algorithm, and Naïve Bayes are applied to
                 predict cancer disease. For acquiring an extensive dataset, Naïve Bayes is very helpful and
                 easy to design. K-nearest neighbor produces a dataset, separating it into distinct categories. It
                 predicts new points in the classification. Grounded on the Decision Tree, J48 Classifier uses
                 such facts from the datasets of training, which can be utilized to decide the minor subsection.
                 To measure the precision of the cancer dataset, the Weka tool cab is applied. Its dataset
                 encompasses nine kinds of cancer. A 70% train and 30% test data set split has been utilized
                 to predict the cancer disease. The exactness of Naïve Bayes is 91.81%, whereas the identity
                 of J48 and K-nearest neighbor is respectively 92.98% and 97.07%.

                 Keywords 1
                 Breast Cancer, K-nearest neighbor, j48 algorithm, Naïve Bayes

1. Introduction
1.1.      Background
    Early diagnosis is essential to enhance the results and survival rates of breast cancer. There are two
strategies for early detection of breast cancer, including screening and early diagnosis. The settings
with limited resources and poor health organizations where most women are detected in advanced
stages must focus on the early detection programs grounded on the consciousness of early symptoms
and quick recommendations for analysis and treatment. Screening comprises assessing women for
identifying the risks of cancer before the appearance of symptoms. These screening tools include
breast self-exam, clinical breast exam, and mammography. Since it requires significant investment,
the decision to continue with the screening procedure should be made following fundamental breast
health amenities that include efficient detection and appropriate treatment. Early diagnosis involves
delivering apt access to cancer treatment by decreasing the barriers that come before enhancing access
to proper diagnosis services. The purpose is to study the comparison of Naïve Bayes, KNN, and J48
classifier accuracy in predicting the proportion of breast cancer identification in the early phases [5,
12]. Cancer is a genetic disorder that happens due to the alterations to the genes and the sudden
expansion of cells and division. Metastatic cancer is when the cancer cells spread to another place

CVMLH-2022: Workshop on Computer Vision and Machine Learning for Healthcare, April 22 – 24, 2022, Chennai, India.
EMAIL: gitaskumar@yahoo.com (Geetha Ganesan)
ORCID: 0000-0001-7338-973X (Geetha Ganesan)
            ©️ 2022 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                    47
within the body from the location where it developed. This spread of cells from one part to another is
known as metastasis.

1.2.    Motivation
   The data mining technique is expanding extremely speedily in the medical ground because of its
accomplishment in classification and prediction processes that aid the experts in decision-making. We
are searching for ways to improve the patients' health and decrease the expense of medicine, and data
mining assists a lot in this case. A few papers are accessible to predict the disorder, and the most
crucial part is that it is merged with the prediction of the existence of the specific cancer disease.
Many types of cancers are still unknown. Doctors are sometimes unable to find out the reason behind
the disease. For curing the disorder, early detection is needed, and undertaking the prediction research
is critical. Here we utilize the open-source data mining tool Weka, which was created at Waikato
University, New Zealand. It assists us in predicting the cancer disease precisely and aids in proper
decision-making. We have used renowned classification procedures called K-Nearest, J48, and Naïve
Bayes. A dataset of breast cancer diagnosed patients is used to show an apt answer.

1.3.    Paper Organization
   The present paper has seven sections that initiate with the introduction segment and is followed by
Related Work, Problem Statement, Data and Classifier Details, Methodology, Results and Analysis,
and Conclusion sections.

2. Related Work
   Scientists are striving hard to reduce the consequences of malignancy. Thus, there are multiple
queries regarding predicting the survivability of cancer. Thousands of people pass away due to the
most frequent breast cancer. It occurs in humans owing to damaged genes caused because of
increasing age. Age activates a combination of factors generating mutations, which, in turn, develops
tumors. Thus, early diagnosis is necessary; hence, designing a genetic mutation-based strategy to
predict cancer has become essential.

    The Clustering procedure and diverse classification procedures were utilized by Dona Sara Jacob
et al. [6]. The results show that the classification algorithms were superior to the clustering procedure.
Grounded on D. Support Vector Machine, studies evaluated all the procedures, indicating that Naïve
Bayes was faster than the SVM model. The latter is an ML technique based on K-nearest neighbor
and decision tree. Among the well-known data mining processes utilized, four belonged to E.K-means
clustering procedures. Research proves that classification procedures predict better than clustering
procedures in the case of predicting breast cancer. The most acceptable breast cancer prediction
algorithm is decided on the exactness of the procedure.

    Alom et al. [1] used the Inception Recurrent Residual Convolutional Neural Network to classify
medical images of breast cancer. The specified neural network exhibited higher performance than
parallel Inception Networks, RCNNs, and Residual Networks. Satapathi et al. [10] provided a
prediction modeling that uses transformed genetic cells due to abrupt abnormalities causing
carcinogenic cells in the human body. Saritas & Yasar [9] studied the classification performance of
Naïve Bayes classifiers for data containing nine inputs and one output to compare with the Artificial
neural network. They worked on ANN and Naïve Bayes classification algorithms-based data
classification to predict the preeminent scope for breast cancer detection using anthropometric data
and standard blood tests results. Kumar et al. [7] work focused on various forms of data mining
technologies to detect malignant and benign breast cancer. The UCI data set was used during clump
thickness as an assessment level of Breast Cancer. Twelve algorithms, including J48, Lazy IBK,
Logistics Regression, and Naïve Bayes' performances, were evaluated.

                                                     48
   According to Rashmi et al. [8], data mining refers to the method of repossessing the details from
an enormous dataset. For extracting helpful information, the data ought to be appropriately arranged.
The method is utilized to explore a considerable quantity of data for finding some constant patterns.
The paper delivers an assessment of the techniques of Prediction and Classification. Breast cancer
signifies 12% of fresh cases of this disorder. It is noted to be the second-most occurred cancer
globally. It becomes essential to detect the tumor type if diagnosed in the early stages. Pathologists
can view the microscopic structures and components of the breast tissue histologically through a
breast tissue biopsy. Those histological photographs allow the pathologist to differentiate between
normal tissue, non-malignant (benign) tissue, and malignant lesions.

3. Problem Statement
   Breast cancer is the most prevalent form of cancer, where a painless, hard lump is the most
prominent symptom of the breast. Like most tumors, breast cancer can be best treated when diagnosed
early. Early detection of breast cancer raises the number of viable treatment choices and dramatically
increases the probability of clinical success and recovery. About 70% of healthy tumors are reported
with this method, but between 4% and 34% of breast carcinoma cases cannot be diagnosed by
mammography. A number of researches on artificial intelligence, computer learning, and data
processing were reviewed to identify breast cancer. Danacı et al. [3] used pattern recognition to detect
breast cancer cells using the C4.5 algorithm in the Waikato Environment for Knowledge Analysis
(Weka) tool.

   The present study aimed to gain a comparative insight into decision-making processes by the
classification of breast cancer data, such as naive Bayes, naïve Bayes, and J48 decision trees. In this
study, the breast cancer dataset was taken from Kaggle, which Syeda Daraqshn contributed, and was
examined using the Weka tool [4]. The effectiveness levels of J48, K-nearest neighborhood, and
Naive-Bayes data mining algorithms have been compared.

4. Data and Classifier Details
4.1.    Data Details
   The cancer patients’ data were collected from the Kaggle website created on 28/08/2020 [4]. There
are data from 569 instances that contain 30 attributes and a single class attribute. The 30 attributes
include symptoms and the stages of breast cancers (benign/ malignant) in .csv format. The dataset was
divided into train and test datasets with a 70-30 split in Weka. The 30 attributes can be classified in
mean, se, and worst categories for the ten specified attributes (radius, texture, perimeter, area,
smoothness, compactness, concavity, concave points, symmetry, and fractal dimension).

4.2.    Classifier Details
   The most heard term in Weka is a classification that assists in decision-making. Classification can
be divided into two parts: multi-class targets and binary. The latter comprises two kinds of outcomes,
whereas the former aims at delivering superior two values. The key objective is to categorize and
forecast accurately. Forecasting relies on some associated data that can inform us about further
happenings. Prediction develops a connection between the data that people are aware of and the data
that people require forecasting in the future. Among several types of classifiers, three different types
are utilized in this paper K-Nearest Neighbor, J48 algorithm, and Naïve Bayes. We used Weka 3.9
version and Windows 10 OS.

   Naïve Bayes is a straightforward technique regarded as "probabilistic classifiers" [11]. It
encompasses an amalgamation of procedures that share standings, on which each aspect is categorized

                                                    49
self-reliantly from other presented values. It is executed in the supposition that the impression of a
specific value does not rely upon other characteristic values. This particular algorithm is very helpful
to acquire a big dataset and extremely easy to develop.
    K-Nearest Neighbor (KNN) is a non-parametric and indolent procedure that utilizes data sets and
produces a new dataset, separating it into distinct categories and forecasting new points in
classification [2]. K-nearest neighbor, also known as instance-based learning, memorizes the opinion
to categorize the unnoticed test data. It makes a comparison between the training observations and the
test observations.

   J48 utilizes a non-proprietary Java incorporation C4.5 procedure. For instance, if we have a dataset
that comprises dependent variables and the other encompasses independent variables, the application
of the decision tree such as J48 enables us to forecast new records. It acts fine on incessant and
distinct also missed out data values. It even provides a choice for cropping trees after generation. The
classifier avails of the avaricious process also decreased the fault. The test condition outcome is
displayed as a branch, and a class tag is given for every final node. Typically, the root node is the
largest decision tree node. Any path is an adjective concept in a decision tree. In two steps, it typically
uses the depth-first method. Tree preparation in the top-down technique takes place. The layout is
separated repetitively at this point before the data elements belong to the same class plate.

5. Methodology
   The following figure presents the process of the classification algorithms with the training dataset.
A biopsy has to be performed if symptoms characteristics indicate malignancy. Otherwise, there will
be no requirement for the biopsy test in the case of benign characteristics. The data mining process
involves two sections. First, the classification models were trained with the 70% split train dataset.
Finally, prediction using the test dataset has been noted.


Figure 1: Classification flow of Prediction of Breast Cancer

6. Results and Analysis
    In this study, three classification algorithms have been used to compare their performances
regarding accuracy in correctly predicting the instances in Weka 3.9 version. The best algorithm has
been chosen based on the best accuracy of prediction. The benign and malignant classes have been
analyzed based on error rate, accuracy, precision, F-score, sensitivity, and specificity. Whereas
sensitivity will indicate the true value, specificity indicates the correct negative cases. A cost-benefit
analysis has discussed the confusion matrices for all three cases. The confusion matrices were
interpreted in terms of TP, FP, TN, FN (T = True, P = Positive, N = Negative, F = False).


                                                     50
6.1.    Data Exploration
  Exploration of 569 instances revealed no missing value present for any attribute. There were 212
malignant and 357 benign cases.

6.2.    Performance Exploration of the Classifiers
   Naïve Bayes classifiers are non-deterministic in nature that can handle noisy big data. The
algorithm is generally faster than the KNN classifier and is based on the hypothesis that all the factors
correlate among them a contribution to the classification. In the present scenario, the algorithm
correctly predicted 91.81% instances (n = 157) from the test dataset of 171 total instances. The
precision of prediction was higher for benign cases compared to malignant tumors. The confusion
matrix revealed that 59 instances were predicted correctly out of 65 for malignant cases, whereas 98
instances for benign cases were properly predicted out of 106. The following table provides a
detailed, accurate description of the two classes. The root mean square (RMSE) error was 0.29, and
the relative absolute error (RAE) was 17.49%.

Table 1: Detailed Accuracy of Naïve Bayes Classifier on Breast Cancer Data
            Class TP-       FP-    Precision Recall         F-      MCC ROC-           PRC-
                    Rate Rate                           Measure            Area        Area
             M      0.91 0.08        0.88       0.91       0.89     0.83   0.97        0.93
              B     0.93 0.09        0.94       0.93       0.93     0.83   0.97        0.99

    K-Nearest Neighbor (Lazy IBK) is a non-deterministic classifier that is generally slower for huge
data sets and refuses to deal with the noisy dataset. The KNN algorithm for N = 1 predicted 95.91%
of instances from the test dataset. The classifier produced an optimum prediction of 97.07% correctly
classified instances at N = 4. The precision of prediction was higher for benign cases compared to
malignant tumors. The confusion matrix revealed that 63 instances were predicted correctly out of 65
for malignant cases, whereas 103 instances for benign cases were properly predicted out of 106
instances. The following table provides a detailed, accurate description of the two classes. The RMSE
was 0.16, and the RAE was equal to 10.10%.

Table 2: Detailed Accuracy of KNN (N = 4) Classifier on Breast Cancer Data
                    TP-                                      F-                 ROC-    PRC-
           Class          FP-Rate Precision Recall                    MCC
                   Rate                                   Measure               Area    Area
            M      0.97 0.03         0.96         0.97      0.96      0.94      0.98    0.98
             B     0.97 0.03         0.98         0.97      0.98      0.94      0.98    0.98

    J48 Decision Tree is a deterministic classifier that can deal with a noisy large dataset with high
accuracy. The J48 is a depth-first or breadth-first approach technique consisting of roots nodes and
internal and leaf nodes. In the present study, the classifier produced an optimum prediction of 92.98%
correctly classified instances. The precision of prediction was higher for benign cases compared to
malignant tumors. The confusion matrix revealed that 58 instances were predicted correctly out of 65
for malignant cases, whereas 101 instances for benign cases were properly predicted out of 106. The
following table provides a detailed, accurate description of the two classes. The RMSE was 0.26, and
the RAE was equal to 17.72%.

Table 3: Detailed Accuracy of J48 Classifier on Breast Cancer Data
                    TP-                                       F-                ROC-    PRC-
           Class           FP-Rate Precision Recall                     MCC
                   Rate                                   Measure               Area    Area
             M     0.89 0.05           0.92        0.89     0.91        0.85    0.91    0.87


                                                    51
             B      0.95    0.11        0.94       0.95      0.94       0.85    0.91    0.91


7. Conclusion
    This paper is chiefly grounded on the medical dataset that can forecast Cancer prevalence from the
symptoms. In this case, three different classification procedures are used to validate the dataset. For
fulfilling this purpose, we utilized the Weka 3.9 tool. The application consists of classification and
data exploration. The dataset does not create any viciousness as it does not consist of any personal
details. It comprises a few medical info. From the evaluation, we can state that KNN did the most
accurate classification for N= 4, which could correctly classify almost 97% of instances. For detecting
the confusion matrix, three distinct algorithms are utilized. It comprises details regarding the actual
and forecasted classification. There are two stages of this disorder- the one is malignant, and the other
one is benign stages. For having a clear perception, tables are needed to be created. It compares the
three different classification procedures, namely K-Nearest Neighbor, Naïve Bayes, and J48
algorithms. The contrast table lucidly states which model of classification will be better. All three
algorithms function outstandingly; however, in this study, K-Nearest Neighbor has worked more
supremely than the other two procedures: the J48 algorithm and Naive Bayes. In the upcoming days,
we will strive to revolutionize the technologies and equipment for more significant improvement,
expand datasets, and implement distinct preparation, clustering, classification, visualization, and
regression. In addition, we will act in advancing the new models’ survivability and predictive
capabilities.

8. References
[1] Alom, M., Yakopcic, C., Nasrin, M., Taha, T., Asari, V.: Breast Cancer Classification from
    Histopathological Images with Inception Recurrent Residual Convolutional Neural Network.
    Journal of Digital Imaging. 32, 605-617 (2019).
[2] Bharati, S., Rahman, M., Podder, P.: Breast cancer prediction applying different classification
    algorithm with comparative analysis using WEKA. 4th International Conference on Electrical
    Engineering and Information & Communication Technology (iCEEiCT). IEEE (2018).
[3] Danacı, M., Çelik, M., Akkaya, A.: Prediction and diagnosis of breast cancer cells using data
    mining methods. ASYU’2010. pp. 9-12. , Kayseri, Turkey (2010).
[4] Daraqshan, S.: Kaggle: Your Machine Learning and Data Science Community, 8.
    https://www.kaggle.com/syedadaraqshan/breast-cancer-prediciton-using-machine-
    learning.
[5] Dubey, A., Gupta, U., Jain, S.: Comparative Study of K-means and Fuzzy C-means
    Algorithms on The Breast Cancer Data. International Journal on Advanced Science,
    Engineering and Information Technology. 8, 18 (2018).
[6] Jacob, D., Viswan, R., Manju, V., PadmaSuresh, L., Raj, S.: A Survey on Breast Cancer
    Prediction Using Data MiningTechniques. In 2018 Conference on Emerging Devices and Smart
    Systems (ICEDSS). pp. 256-258. IEEE (2018).
[7] Kumar, V., Mishra, B., Mazzara, M., Thanh, D., Verma, A.: Prediction of Malignant and Benign
    Breast Cancer: A Data Mining Approach in Healthcare Applications. Advances in Data Science
    and Management. pp. 435-442. Springer, Singapore (2020).
[8] Rashmi, G., Lekha, A., Bawane, N.: Analysis of Efficiency of Classification and Prediction
    Algorithms (Naïve Bayes) for Breast Cancer Dataset. 2015 International Conference on
    Emerging Research in Electronics, Computer Science, and Technology (ICERECT). pp. 108-
    113. IEEE (2015).

                                                    52
[9] Saritas, M., Yasar, A.: Performance Analysis of ANN and Naive Bayes Classification Algorithm
     for Data Classification. International Journal of Intelligent Systems and Applications in
     Engineering. 7, 88-91 (2019).
[10] Satapathi, G., Srihari, P., Jyothi, A., Lavanya, S.: Prediction of cancer cells using DSP
     techniques. International Conference on Communication and Signal Processing. pp. 149-153.
     IEEE (2013).
[11] Xu, S.: Bayesian Naïve Bayes classifiers to text classification. Journal of Information Science.
     44, 48-59 (2018).
[12] Verma, D., Mishra, N.: Analysis and prediction of breast cancer and diabetes disease
    datasets using data mining classification techniques. In 2017 International Conference on
    Intelligent Sustainable Systems (ICISS). pp. 533-538. IEEE (2017).


                                                  53

</pre>