<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Efficient Algorithm for the Prediction of Cancer of the Kidney Using Data Analytic Technique</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aranuwa Felix Ola</string-name>
          <email>felix.aranuwa@aaua.edu.ng</email>
          <email>ogundareolanike@yahoo.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aekunle Ajasin University</institution>
          ,
          <addr-line>Akungba - Akoko, Ondo State</addr-line>
          ,
          <country country="NG">Nigeria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Malaysia University of Science and Technology, Malaysia University of Science and Technology</institution>
          ,
          <addr-line>Selangor, Malaysia Selangor</addr-line>
          ,
          <country country="MY">Malaysia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>7</fpage>
      <lpage>9</lpage>
      <abstract>
        <p>Our focus in this research work is to present an efficient algorithm for apt prediction of cancer of the kidney in which medical practitioners and patients could gain valuable knowledge for early and proactive intervention strategies to save lives from this harmful disease. To achieve these objectives, dataset pertaining to patients of cancer of the kidney were acquired from selected private and public hospitals in south west Nigeria. A two-layered classifier system consisting of Rule Induction (RI) and Decision Tree (DT) classifiers was designed to build the model based on data analytic approach. The classifier system designed was tested successfully using case study data from fifty-two (52) selected Local Governments in South West Nigeria using purposive and selective sampling technique. Ten classification algorithms were used in the modeling. Waikato Environment for Knowledge Analysis was used for the experiment and each model was built in two different ways (10-fold cross validation and percentage split mode). Performance comparison of the various algorithms considered was carried out using standard metrics of accuracy for classification and speed of model building benchmarks. The experimental results show that the J48 decision tree algorithm outperform all other algorithms in all the layers with correctly classified instances of 74.7%, F-Measure of 0.614, TP rate of 0.747, FP rate of 0.135, precision and recall of 0.687 and 0.714 respectively. It took the best algorithm, 0.03 seconds to build the model. This proves that the algorithm is suitable for the research purpose. The results from the system framework when tested with test data shows that the identified attributes, algorithm and the system model performed well and can serve as valuable tool for early detection of the disease in patients.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data Analytics</kwd>
        <kwd>Classification Algorithms</kwd>
        <kwd>Data Mining</kwd>
        <kwd>Kidney Cancer</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Data analytic has proven to be a multi-dimensional discipline that
uses descriptive techniques and predictive models to gain valuable
knowledge from data warehouses for recommendations and
decision making. It is the discovery of patterns and
communication of meaningful insight in data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. According to
Berson, Smith and Thearling (1999), data analytics is the science
of examining raw data with the purpose of drawing conclusions
from it [
        <xref ref-type="bibr" rid="ref8">9</xref>
        ]. It focuses on inference, identify undiscovered patterns
and establish hidden relationships[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Figure 1 depicts the process
of data analytics. The science is generally divided into exploratory
data analysis (EDA), where new features in the data are
discovered and confirmatory data analysis (CDA) where existing
hypotheses are proven true or false. Typically, it is used to
describe the technical aspects of data analysis,
especially predictive modeling, machine learning techniques. Data
Analytics has been commonly apply to business data, marketing
mix modeling, web analysis, risk analysis and fraud analysis to
communicate insights from data. It is very good in recommending
action and guide decision making,
      </p>
      <p>Age Group
31-40
2.</p>
    </sec>
    <sec id="sec-2">
      <title>METHOD AND MATERIALS</title>
      <sec id="sec-2-1">
        <title>Variable Name</title>
      </sec>
      <sec id="sec-2-2">
        <title>Variable Format</title>
      </sec>
      <sec id="sec-2-3">
        <title>Variable Type</title>
        <sec id="sec-2-3-1">
          <title>Gender</title>
          <p>Age
Lifestyle</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>G&amp;H Disorder C &amp; I Exposure Prediction Level</title>
          <p>Male, Female</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2.1 Data Collection and Data Format</title>
      <p>Dataset pertaining to this research work was collected from
selected health centres and hospitals in the south western part of
Nigeria using purposive and selective sampling techniques. The
researcher collected a sample data totaling, 1,006 records from
fifty-two selected health centres in six (6) different states. The
data collected was cleaned, normalized and organized in a form
suitable for data analytic process. Table 1 shows the data format
for the research data collection while Figure 1 and Figure 2 show
the visualized information about selected states and health centres
respectively.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Data Analysis &amp; Interpretation</title>
      <p>Statistically, out of the 1,006 patient’s data captured, 44.8% were
male while the remaining 55.2% are female, (See Table 2). The
analysis further revealed that 57.1% of the patients are exposed to
chemical and industrial contents while 32.7% of the population
as gender and hereditary disorder. The patient’s life style data
collected also indicated that the people around this region are
addicted to smoking and drinking of alcohol, regular use of
nonsteroidal anti-inflamatory drug (NSAIDs) such
as ibuprofen and naproxen, which can double the risk of the
disease by 51%. Other factors include obesity; faulty genes; a
family history of kidney cancer; having kidney disease that
needs dialysis; being infected with hepatitis C; and previous
treatment for testicular cancer or cervical cancer. There is an
indication also, that High blood pressure is a possible risk factor
though still under investigation.</p>
    </sec>
    <sec id="sec-5">
      <title>3. DESIGN OF EXPERIMENT AND</title>
    </sec>
    <sec id="sec-6">
      <title>RESULTS</title>
    </sec>
    <sec id="sec-7">
      <title>3.1 Research Experimental Platform</title>
      <p>
        Waikato Environment for Knowledge Analysis (WEKA) platform
was used for the data analytic experiment. It is a powerful data
mining tool that has a GUI Chooser from which any of the four
major WEKA application environments (Explorer, Experimenter,
KnowledgeFlow and Simple CLI) can be selected. The Explorer
Application is selected for this experiment because it has a
workbench that contains a collection of visualization tools, data
processing, attribute ranking and predictive modeling with
graphical user interface (GUI) for easy access to this
functionalities, which are very important to the research work.
WEKA is a collection of machine learning algorithms for data
mining tasks. Algorithms implemented in WEKA include:
Bayesian classifiers, Decision Trees, Rules, Artificial Neural
Network (Functions), Lazy classifiers and miscellaneous
classifiers. But for the purpose of this work Rule Induction and
Decision Tree classifiers was considered. These families of
classifiers have been selected because of their performances in
various domains. They have both been successfully applied to a
variety of real-world classification tasks in industry, business,
science and education with good performances [10]. The classifier
system designed for the data modeling as shown in Figure 3 is of
two layers: Layer 1 consists of JRiP, PART and Decision Table of
the family of Rules Induction and Layer 2 consists of J48, LAD
Tree, Decision Stump, Random Forest, Rep Tree, BF Tree, and
LMT from the family of Decision Tree. The Decision Tree also
known as “white box” classification model can provide
explanation for their models, and could be used directly for
decision making [5], while the Rule Induction is one of the
fundamental tools of data mining, in which formal rules are
extracted from a set of observations. The rules extracted represent
a full scientific model of the data [
        <xref ref-type="bibr" rid="ref5">6</xref>
        ]. According to Kapil et al.,
(2013), rule induction is a popular and well researched method for
discovering interesting relations between variables in large
database. These abilities and aptitudes of rule induction are suited
and of good requirement for any effective and efficient intelligent
system. A major paradigm of the Rule Induction is the
Association Rules [
        <xref ref-type="bibr" rid="ref6">7</xref>
        ].
      </p>
      <p>Patient’s
Databank</p>
      <p>Classifier System</p>
      <p>Layer 1
Rule Induction</p>
      <p>Layer 2</p>
      <p>Decision Tree</p>
      <sec id="sec-7-1">
        <title>Performance Evaluation Optimal Algorithm</title>
      </sec>
      <sec id="sec-7-2">
        <title>KC Prediction System</title>
        <p>As shown in Figure 3, the patient’s databank component is
responsible for the data collection, updating and storing patient’s
data from different sources. The classifier system component is
responsible for the data modeling based on the algorithms in the
layers. The performance evaluation component is responsible for
the evaluation of the performance of the algorithms considered in
the layers using standard metric to produce the best (optimal)
algorithm. The rule generated from this algorithm is to be
incorporated into the prediction system. Since the objective of the
research work is to present a suitable algorithm for the cancer of
the kidney prediction system, which the work has achieved. Hence
the prediction system processes is not discussed in the work, but
will be discussed in the future work of this research.
3.2</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Experimental Results</title>
      <p>Ten (10) classification algorithms from the family of classifiers
implemented in this work were used to model the patient’s
dataset. The datasets for the experiment was first divided into two,
which includes the training and testing datasets. 66% of the
datasets was devoted to training while the remaining 34% was
used for testing of randomly selected data. JRip, PART and
Decision Table in layer 1 of the classifier system were first used to
model the patient’s data and later the Decision Tree classifiers.
The 10-fold cross validation test and percentage split modes were
also considered in the modeling. Since they are from different
classifiers family, they yielded different models that classify
differently on some inputs. The algorithms were tested on the
datasets in order to determine that which best models the data
with best predictive accuracy.</p>
      <p>
        The comparison of the performance of the various algorithms in
layer 1 and layer 2 based on the output from the percentage split
(hold-out) and 10-fold cross validation modes was carried out.
The results of the models from the two modes and the
performance evaluations are presented in Table 3. The 10-fold
cross-validation test mode was considered good since it produced
the best model both in layer 1 and 2 of the classifier system.
Moreover, the 10-fold cross validation mode have been widely
used, and it is described a better option to determine the
performance of a classifier [
        <xref ref-type="bibr" rid="ref7">8</xref>
        ]. Table 4 shows the standard metric
accuracy details from the 10-fold cross validation mode
considered for all the algorithms in the experiment. Figure 4 and
Figure 5 show the graphs of predictive accuracy and time taken to
build the models by the classifiers respectively.
1
2
3
4
5
6
7
8
9
10
1
0.714
0.746
0.731
0.658
0.749
0.718
0.704
0.647
0.643
0.716
      </p>
      <p>
        FMeasure
0.614
From the experimental results and analysis, it shows that the J48
decision tree and LMT rules outperform all other algorithms in
the layers. However, J48 decision tree was chosen as the best
algorithm in this work because it has the correctly classified
instances of 74.7%, ROC Area of 0.78 and recall of 0.714
respectively. It has a lower FP rate of 0.153, F-Measure of 0.614
and took lesser time of 0.03 seconds to build the model compared
to LMT and other classifiers as shown in Table 4. Additionally,
J48 decision tree algorithms generally have this ability that can
produce a simple tree structure with high accuracy in term of
classification rate, even with huge volume of data [
        <xref ref-type="bibr" rid="ref8">9</xref>
        ]. Pruning
methods have been introduced to reduce the complexity of tree
structure without any decrease in classification accuracy. The J48
decision tree structure and rules as generated by WEKA are
presented in Figure 6.
The rules generated from the best algorithm (J48 pruned decision
tree) are as stated in rules 1 to 20. The rules were tested in a
prediction system framework and their prediction levels are
classified as follows: (PL) – One, Two and Three. This show the
status of patients and by interpretation: Level One and Two
indicates a risk level or status of the disease manifestation in the
patients that needs to be attended to urgently. While, level Three
indicates that the patient is not manifesting any symptoms of
kidney cancer disease, but may suffer from other diseases. A
back-end for updating the rules as the situation arises will be
incorporated into the system to match other conditions.
Rule 1: IF (G&amp;H Disorder = NO) AND (C&amp;I Exposure = Yes)
AND (Lifestyle = Smoking) AND Complaints = blood in urine:
PL = One
Rule 2: IF (G&amp;H Disorder = NO) AND (C&amp;I Exposure = Yes)
AND (Lifestyle = Smoking) AND Complaints = back pain: PL =
Two
      </p>
      <p>Rule 3: IF (G&amp;H Disorder = NO) AND (C&amp;I Exposure = Yes)
AND (Lifestyle = Smoking) AND Complaints = tumor: PL =
Three
Rule 4: IF (G&amp;H Disorder = NO) AND (C&amp;I Exposure = Yes)
AND (Lifestyle = Smoking) AND Complaints = Fibroids: PL =
Three
Rule 5: IF (G&amp;H Disorder = NO) AND (C&amp;I Exposure = Yes)
AND (Lifestyle = Smoking) AND Complaints = Stomach ucher :
PL = Two
Rule 6: IF (G&amp;H Disorder = NO) AND (C&amp;I Exposure = Yes)
AND (Lifestyle = Smoking) AND Complaints = Kidney pain:
One
Rule 7 IF (G&amp;H Disorder = NO) AND (C&amp;I Exposure = Yes)
AND (Lifestyle = Smoking) AND Complaints = Abdominal pain:
Two
Rule 8 IF (G&amp;H Disorder = YES) AND (C&amp;I Exposure = Yes)
AND (Lifestyle = Smoking) AND Complaints = blood in urine:
PL = One
Rule 9 IF (G&amp;H Disorder = YES) AND (C&amp;I Exposure = Yes)
AND (Lifestyle = Obesity) AND Complaints = blood in urine: PL
= Two
Rule 10 IF (G&amp;H Disorder = YES) AND (C&amp;I Exposure = Yes)
AND (Lifestyle = HB Pressure) AND Complaints = blood in
urine: PL = Two
Rule 11 IF (G&amp;H Disorder = YES) AND (C&amp;I Exposure = Yes)
AND (Lifestyle = Smoking) AND Complaints = Drug Abuse OR
Tumor OR Fibroids: PL = Two
Rule 12 IF (G&amp;H Disorder = YES) AND (C&amp;I Exposure = Yes)
AND (Lifestyle = Smoking) AND Complaints = Abdominal pain:
PL = Two
Rule 13 IF (G&amp;H Disorder = YES) AND (C&amp;I Exposure = Yes)
AND (Lifestyle = Smoking) AND Complaints = Kidney pain: PL
= One
Rule 14 IF (G&amp;H Disorder = YES) AND (C&amp;I Exposure = Yes)
AND (Lifestyle = Smoking) AND Complaints = stomach ucher:
PL = One
Rule 15 IF (G&amp;H Disorder = YES) AND (C&amp;I Exposure = Yes)
AND (Lifestyle = Alcohol OR Dialysis) AND Complaints =
stomach ucher: PL = Two
Rule 16 IF (G&amp;H Disorder = YES) AND (C&amp;I Exposure = Yes)
AND (Lifestyle = Radiation) AND Complaints = stomach ucher
OR blood in urine: PL = One
Rule 17 IF (G&amp;H Disorder = YES) AND (C&amp;I Exposure = Yes)
AND (Lifestyle = Water pills) AND Complaints = stomach ucher:
PL = Three Rule 18 IF (G&amp;H Disorder = YES) AND (C&amp;I
Exposure = NO) AND (Lifestyle = Smoking) AND Complaints =
stomach ucher OR kidney pain: PL = One
Rule 19 IF (G&amp;H Disorder = YES) AND (C&amp;I Exposure = NO)
AND (Lifestyle = Smoking) AND Complaints = stomach ucher:
PL = Two
Rule 20 IF (G&amp;H Disorder = YES) AND (C&amp;I Exposure = NO)
AND (Lifestyle = Smoking OR Obesity OR Drug Abuse OR
Radiation OR Water Pills OR Dialysis) AND Complaints =
stomach ucher: PL = Three</p>
    </sec>
    <sec id="sec-9">
      <title>4. CONCLUSIONS</title>
      <p>The research work was focused at presenting an efficient
algorithm suitable for predicting the status of kidney cancer in
patients. To achieve the objectives of the research work: (i).
Dataset pertaining to patient was acquired from fifty LGA (52)
selected Health Centres in the south western region of Nigeria
using purposive and selective sampling techniques. (ii) the
researcher developed a two-layered classifier system consists of
Rule Induction and Decision Trees implemented on Waikato
Environment for Knowledge Analysis (WEKA) to build the data
model using data analytic approach, and (iii) different machine
learning algorithms were used in search for the algorithm that
produced the best model with predictive accuracy. In the
experiment, ten (10) classification model algorithms from
different classifier family were implemented on the
patients’dataset. Since they are from different classifiers family,
they yielded different models that classify differently on some
inputs. The comparison of the performance of the various
algorithms in layer 1 and layer 2, and the standard metrics of
accuracy, precision, recall and f-measure for the best classifier
considered in this work was carried out as shown in Table 3 and
Table 4 respectively. The results show that the J48 decision tree
outperform all other algorithms in the layers with predictive
accuracy of correctly classified instances of 74.7 % in 0.03
seconds, ROC Area of 0.78, FP rate of 0.153, TP rate of 0.714,
precision and recall of 0.614.
.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          20-
          <fpage>30</fpage>
          [1]
          <string-name>
            <surname>Lasebikan</surname>
            <given-names>OA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nwadinigwe</surname>
            <given-names>CU</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Onyegbule</surname>
            <given-names>EC</given-names>
          </string-name>
          <article-title>Pattern of bone tumours seen in a regional orthopaedic hospital in Nigeria.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Kushi</surname>
            <given-names>LH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doyle</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McCullough</surname>
            <given-names>M</given-names>
          </string-name>
          , et al. (
          <year>2012</year>
          ).
          <article-title>"American Cancer Society Guidelines on nutrition and physical activity for cancer prevention: reducing</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Kohavi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rothleder</surname>
            ,
            <given-names>N. J'</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Simoudis</surname>
            ,
            <given-names>A.P</given-names>
          </string-name>
          (
          <year>2002</year>
          )
          <article-title>: Emerging Trends in Business Analytics Published by ACM Volume 45 Issue 8</article-title>
          ,
          <string-name>
            <surname>Pages</surname>
          </string-name>
          45-
          <issue>48</issue>
          <year>August 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Berson</surname>
          </string-name>
          ,
          <article-title>Smith ad Thearling ((</article-title>
          <year>199</year>
          ) [5]
          <string-name>
            <surname>Romero</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Olmo</surname>
            ,
            <given-names>J. L</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Ventura</surname>
            ,
            <given-names>S</given-names>
          </string-name>
          (
          <year>2013</year>
          )
          <article-title>: A meta-learning approach for recommending a subset of white-box classification algorithms for Moodle datasets</article-title>
          . Department of Computer Science, University of Cordoba, Spain.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Grzymala-Busse</surname>
            ,
            <given-names>J. W</given-names>
          </string-name>
          (
          <year>2013</year>
          ). Rule Induction - University of Kansas.
          <source>Extracted</source>
          <volume>20</volume>
          -
          <fpage>06</fpage>
          -
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Kapil</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sheveta</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heena</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Richa</surname>
            ,
            <given-names>D</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Jasreena</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>B (</article-title>
          <year>2013</year>
          ).
          <article-title>A Hybrid Approach Based On Association Rule Mining and Rule Induction in Data Mining International Journal of Soft Computing and Engineering (IJSCE</article-title>
          ) ISSN:
          <fpage>2231</fpage>
          -
          <lpage>2307</lpage>
          ,
          <issue>Volume3</issue>
          , Issue-1,
          <year>March 2013</year>
          146.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [8]
          <string-name>
            <surname>WEKA</surname>
          </string-name>
          ,(
          <year>2011</year>
          )
          <article-title>: WEKA Tutorial</article-title>
          . The University of Waikato (
          <year>2011</year>
          ). Available at: http://www.cs.waikato.ac.nz/ml/weka/,
          <source>(Accessed 20 July</source>
          ,
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Mohamed</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Nor Haizan</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohd</surname>
            <given-names>N. S,</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Abdul H. O</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>A Comparative Study of Reduced Error Pruning Method in Decision Tree Algorithms</article-title>
          .
          <source>IEEE International Conference on Control System, Computing and Engineering</source>
          ,
          <volume>23</volume>
          -
          <fpage>25</fpage>
          Nov.
          <year>2012</year>
          , Penang, Malaysia
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>