<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Journal of Computer Science Issues(IJCSI) (2012) 9.
[15] D. J. Wu</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">0933-3657</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1007/s00521-021-05841-x</article-id>
      <title-group>
        <article-title>Classificatikon of Obesity Types using Random Forest and Decision Tree algorithms*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martyna Kramarz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dominik Dzida</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Applied Mathematics, Silesian University of Technology</institution>
          ,
          <addr-line>Kaszubska 23, 44100 Gliwice</addr-line>
          ,
          <country country="PL">POLAND</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <volume>23</volume>
      <issue>3</issue>
      <fpage>14611</fpage>
      <lpage>14626</lpage>
      <abstract>
        <p>In this paper, we focused on usage of two algorithms: Decision Tree and Random Forest and analyzing dataset that contains blood work of adults and adolescence. Dataset consists of 17 columns including last one that classifies in one of six categories of illness : Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Insufficient Weight, Obesity Type II, Obesity Type III. We researched which algorithms were used in similar datasets, and picked those that were most compatible. After data processing and modification to achieve the maximum accuracy. We managed finished with Random Forest with accuracy at 97 percent compared to Decision Tree 95 percent.LATEX.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Machine Learning</kwd>
        <kwd>Data Analysis</kwd>
        <kwd>Obesity Prediction</kwd>
        <kwd>Feature Engineering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Obesity is an enormous problem and the number of people affected by it is growing every
day[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In 2022, 2.5 billion adults aged 18 years and older were overweight, including over 890
million adults who were living with obesity[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Obesity can result in serious health issues that
are potentially life threatening, including hypertension, type II diabetes mellitus, increased risk for
coronary disease, increased unexplained heart failure, hyperlipidemia, infertility, higher
prevalence of colon, prostate, endometrial, and breast cancer [
        <xref ref-type="bibr" rid="ref3">3, 4</xref>
        ]. Using algorithms to fasten the
process of classification of medical problems, can have great results which has been proven to
work in many areas: brain tumor [4, 6], neurodegenerative disorder [5], malaria detection
[7] and many more.
      </p>
      <p>The subject of obesity was also covered in literature and it was established that ML algorithms can
work greatly in self-management of obesity and integration into electronic tools such as
mobile devices and EHRs, for personal or population-based clinical decision-making, can be
explored by researchers towards the development of smart and impactful digital health
interventions [8,9]. The decision of choosing this specific algorithms was based on knowledge that
both algorithms have been tested on similar dataset (one also related to determining degree of
childhood obesity) [10] and analyzing the inner-working of the algorithms.</p>
      <p>Decision Trees is a supervised classification approach that consists of similar elements to
ordinary tree structure: a root and nodes. A Decision Tree starts from the root, moves downward and
generally is drawn from left to right. The node from where the tree starts is called a root
node. The node where the chain ends is known as the “leaf” node. A node represents a certain
characteristic while the branches represent a range of values. These ranges of values act as a
partition points for the set of values of the given characteristic [11].</p>
      <p>Random Forest is an ensemble learning method used for classification and regression tasks that
constructs multiple decision trees during training. It combines the predictions from these trees to
improve accuracy and control over-fitting. Each tree in the forest is built from a random
subset of the training data, and features are randomly selected for splitting at each node. This
randomness leads to a diverse set of models, whose collective output is more robust and
generalizes better to unseen data [12, 13].</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>In our work we were using dataset form site kaggle called Obesity or CVD risk. The dataset
comprises estimates of obesity levels among individuals from Mexico, Peru, and Colombia, aged
between 14 and 61, with varied eating habits and physical conditions. The data was gathered
through a web-based survey where anonymous participants responded to each question. The
collected information was then processed, resulting in 17 attributes and 2111 records.</p>
      <sec id="sec-2-1">
        <title>2.1. Dataset column description</title>
        <p>Our data set consists of 17 columns which are:
• Gender - text value "Male" or "Female"
• Age - numerical value from 14 to 61 (in years)
• Height - comma numeric value from 1.45 to 1.98 (in meters)
• Weight - comma numeric value from 39 to 173 (in kilograms)
• family_history_with_overweight - text value "yes" or "no" - informs if family member
suffered or suffers from overweight
• FAVC - text value "yes" or "no" - frequent consumption of high caloric food
• FCVC - numerical value from 1 to 3 - frequency of consumption of vegetables
• NCP - numerical value from 1 to 4 - number of main meals
• CAEC - text value "Always", "Frequently", "Sometimes" or "no" - consumption of food
between meals
• SMOKE - text value "yes" or "no" - smoker or not
• CH2O - numerical value from 1 to 3 (in liters) - consumption of water daily
• SCC - text value "yes" or "no" - calories consumption monitoring
• FAF - numerical value from 0 to 3 - physical activity frequency
• TUE - numerical value from 0 to 2 - time using technology devices
• CALC - text value "no", "Sometimes", "Frequently" or "Always" - consumption of alcohol
• MTRANS - text value "Public_Transportation", "Walking", "Automobile", "Motorbike" or
"Bike" - transportation used
• NObeyesdad "Normal_Weight", "Overweight_Level_I", "Overweight_Level_II",
"Obesity_Type_I", "Insufficient_Weight", "Obesity_Type_II" or "Obesity_Type_III" - text value
obesity level deducted</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data preparation</title>
        <p>To make working with the data easier, we decided to present all information in numerical
form.We used the conversion for the following columns: "NObeyesdad", "MTRANS", "CALC",
"SCC", "SMOKE", "CAEC", "FAVC", "family_history_with_overweight" and "Gender".We made
the conversion in accordance with the tables below.</p>
        <p>Numeric value
3
2
1
0</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Data analysis</title>
        <p>After changing all values to numerical values, we checked whether there were no columns in
our set containing empty values. After confirming the correctness of all our data, we decided to
check the distribution of this data into individual categories in each column of our dataset.Below are
charts describing this distribution:</p>
        <p>Figure 9: NObeyesdad distribution.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Algorithms</title>
      <sec id="sec-3-1">
        <title>3.1. Selection of algorithms</title>
        <p>In order to select an appropriate classifier for our set, we used the ready-made sklearn library
containing a significant number of classifiers. After comparing the accuracy results of many of
them, we chose Decision Tree and Random Forest due to their high accuracy.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Decision Tree</title>
        <sec id="sec-3-2-1">
          <title>3.2.1. Quick description of the Decision Tree algorithm</title>
          <p>The decision tree algorithm is a machine learning method used for classification and regression. It
involves the iterative division of a data set into subsets based on attribute values. The splitting process
continues until the nodes become homogeneous (contain elements of one class) or they cannot be
further divided in a meaningful way. Decision trees are easy to interpret, but can be prone to
overfitting, which can be mitigated by pruning the tree.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Customizable parameters of the Decision Tree algorithm</title>
          <p>Our Decision Tree implementation has two parameters provided when creating the tree.The first
one is max_depth, which limits the maximum number of nodes that our tree can
reach.Implementing this parameter allows you to influence accuracy by preventing overfitting.
The second parameter is min_leaf which limits the minimum amount of information required to
create another node. Implementing this parameter allows us to avoid a situation in which a
single value causes the prediction to be incorrect.
3.2.3. Implementation of the algorithm
Our implementation contains the following functions:
_create_tree(),
_find_label_prob(), _find_best_split(), _sets_entropy(), _entropy(), predict(), _predict_one_case().
Below are descriptions of the following functions.
init (), train(),
•
init</p>
          <p>()
• _create_tree()
– This function is run when creating the Tree and sets the parameters and required
variables.
– This function is responsible for starting the tree training and running the _create_tree
function.
– This function it is the main function responsible for building the tree that it checks
whether the maximum depth has not been reached, if not using the _find_best_split
function finds the best split of the available data. Then, using the TreeNode class,
the data obtained from subsetting, and the _find_label_prob function, it creates
a new node. For both subsets, the condition of the minimum number of records is
checked. If the condition is not met, the node is returned as a terminal node,
otherwise, for the right and left branches of the node, the _create_tree function is
called with increased depth for the appropriate sets.
– This function returns the probability of occurrence of each class in the set.
– This function returns two sets with the lowest common entropy calculated by the
sets entropy function, the function checks the entropy for three values in each
feature. the function additionally returns the feature and the value for which the
minimum entropy was obtained.
– This function a returns the sum of the entropy calculated by the _entropy function
multiplied by the share of the subset in the entire set.
– This function returns the sum of the results:</p>
          <p>− · log2()
where p is the number of occurrences of the class in the set divided by the length of
the set.
• _find_label_prob()
• _find_best_split()
• _sets_entropy()
• _entropy()
– This function for each row in the test set, it calls the _predict_one_case function
and returns a list of all predicted classes.
– This function compares the value for a given feature with the value stored in the
node and, if possible, moves to the next node. If the next node does not exist, it
returns the probabilities from the node.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Random Forest</title>
        <sec id="sec-3-3-1">
          <title>3.3.1. Quick description of the Random Forest algorithm</title>
          <p>Random forest is a machine learning algorithm that consists of multiple decision trees built on
random subsets of training data. Each tree is trained independently, and the final prediction is
the result of the vote.By randomly sampling data and randomly selecting features, random
forest is resistant to overfitting and is suitable for many machine learning applications.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Customizable parameters of the Random Forest algorithm</title>
          <p>Our Decision Tree implementation has four parameters provided when creating the forest. The
first two max_depth and min_leaf are Decision Tree parameters. The two remaining parameters are
tree_numb and sample_size. The first one is responsible for the number of trees that were
created for the forest. The second one changes the amount of information that is sent to each
tree.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.3. Implementation of the algorithm</title>
          <p>Our implementation contains the following functions: init (), train(), predict(). Below are
descriptions of the following functions.</p>
          <p>•
init</p>
          <p>()</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Adjusting the algorithm parameters</title>
      <p>Changing the parameters of decision-making algorithms may have a significant impact on their
effectiveness, which is why we conducted tests to find the best parameters.</p>
      <sec id="sec-4-1">
        <title>4.1. Decision Tree</title>
        <p>The Decision Tree has only two parameters, max_depth and mini_leaf. Below is a table of
accuracy when changing these parameters.</p>
        <p>The table above shows the accuracy in percentages for each parameter setting.On the horizon- tal
axis we have the min_leaf parameter and on the vertical axis max_depth. The best results are
obtained for depths greater than 10 and the number of leaves equal to 2. The obtained accuracy is
94.32%.For comparison, the classifier from the sklearn library achieved 90.69%, which gives us a
better result by almost 4%.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Random Forest</title>
        <p>Since the Random Forest has four parameters, i.e. max_depth, min_leaf, tree_numb and
sample_size, we decided to conduct research only for the four best results obtained by the Decision
Tree. The parameters for the following tables are respectively: (12,2),(12,4),(10,2),(10,4) where the first
number is max_depth and the second is min_leaf.</p>
        <p>The table above shows the accuracy in percentages for each parameter setting.On the
horizontal axis we have the min leaf parameter and on the vertical axis max depth. The best results are
obtained for depths equal to 12, the number of leaves equal to 2 tree number equal to 5 and sample
size equal to 1.7. The obtained accuracy is 96.37%.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <sec id="sec-5-1">
        <title>5.1. Remove one column</title>
        <p>Below we present an experiment testing the change in precision when one of the columns is
removed. This correlation matrix shows the relationships that exist between individual variables. This
graph shows that if the FAVC columns were removed, an accuracy of 94.795 % would be
achieved.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Summary</title>
      <p>In conclusion, the assumption that the algorithms that worked greatly in datasets related to
childhood obesity, also gave high results in accuracy for our data, was proven to be correct.
The exact estimates are 96.37% on Random forest and 94.32% on Decision Tree, which are both better
results than the algorithms proposed by sklearn library. We also deduced that using them might be
beneficial in making medical diagnosis and self-management of obesity.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Shu-Zhong</surname>
            <given-names>Jiang</given-names>
          </string-name>
          , Wen Lu,
          <string-name>
            <surname>Xue-Feng</surname>
            <given-names>Zong</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hong-Yun</surname>
            <given-names>Ruan</given-names>
          </string-name>
          , Yi Liu, “Obesity and hyper- tension”,
          <source>Experimental and Therapeutic Medicine</source>
          Volume
          <volume>12</volume>
          Issue 4,
          <string-name>
            <surname>October</surname>
            <given-names>2016</given-names>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Bloomgarden</surname>
            <given-names>ZT</given-names>
          </string-name>
          , “
          <article-title>Third Annual World Congress on the Insulin Resistance Syndrome: associated conditions”</article-title>
          ,
          <source>Diabetes Care</source>
          ,
          <year>2006</year>
          Sep;
          <volume>29</volume>
          (
          <issue>9</issue>
          ):
          <fpage>2165</fpage>
          -
          <lpage>74</lpage>
          , doi: 10.2337/dc06-
          <fpage>zb09</fpage>
          .
          <source>PMID: 16936171</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Strumpf</surname>
            <given-names>E:</given-names>
          </string-name>
          <article-title>The obesity epidemic in the United States: causes and extent, risks and solutions</article-title>
          . The Commonwealth Fund; New York, NY:
          <fpage>2004</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>