<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Leveraging Pharmacy Medical Records To Predict Diabetes Using A Random Forest &amp; Arti cial Neural Network</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stephen Lavery</string-name>
          <email>laverys@tcd.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jeremy Debattista</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National College of Ireland</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Trinity College Dublin</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Diabetes is a disease that a ects millions of people around the world. As early diagnosis of the disease is critical, predictive models have been designed in order to classify undiagnosed patients. This study proposes the use of pharmacy medical records for tackling this problem, which have previously remained an untapped resource. As the volume of pharmacy data is greater than that of clinical datasets, models developed using these records could be deployed on a larger scale and a ect a wider number of people. To undertake this research, patient conditions were derived from the pharmacy medical records of 15,812 patients. These conditions, as well as the patients attributes, were then run through a feature selection process to identify the important features for predicting diabetes. These variables were then used to train a random forest and arti cial neural network achieving an accuracy of 77.1% and 76.4% respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>Machine Learning Diabetes Pharmacy Medical Records</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Diabetes is an epidemic disease a ecting millions of people throughout the world.
The World Health Organisation estimates that there are currently more than
240 million people living with the disease, and by 2030 it will be the seventh
leading cause of death [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Diabetes can be split up into two main categories: type
I and type II diabetes. Type I diabetes occurs when the body fails to produce
enough insulin, and type II diabetes which occurs when the body becomes insulin
resistant. Insulin is an important hormone produced by the pancreas, to send
signals to the bodies cells to absorb glucose from the bloodstream. When cells fail
to properly absorb glucose, glucose levels in the bloodstream rise. This can result
in major health complications such as coronary artery disease, heart attack,
stroke, nerve damage and blindness [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. As such, early diagnosis of the disease
is vital in order to better manage the disease and improve patient outcomes. In
this study, type II diabetes will be speci cally tackled.
      </p>
      <p>
        Advances in machine learning have allowed researchers to develop a variety of
predicative models for classifying diabetes [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The limitation with these models
is that they use clinical healthcare datasets, which are not easy to capture as they
require patients to visit a clinical setting. This visit is generally done through
the advise of a healthcare professional. This presents a two-fold problem:
persons being tested have already been identi ed as at-risk and they might already
have signi cant health consequences if they have been living with undiagnosed
diabetes. Testing patients in a clinical setting also requires manual processing of
test results. This is an expensive process as specialised equipment and healthcare
professionals are needed in order to determine a diagnosis. Pharmacy medical
records are a a passively generating data source, which are gathered whenever a
patient visits their local pharmacy. This o ers a readily available set of patient
features, that can be used for ongoing diabetes screenings. This could greatly
reduce the cost of preliminary screens and could lead to an earlier diagnosis.
      </p>
      <p>In this study, a machine learning model for predicting diabetes is proposed
using pharmacy patient medical records. For this work, the core research question
is de ned:</p>
    </sec>
    <sec id="sec-2">
      <title>To what extent can pharmacy patient medical records be used to predict type II diabetes?</title>
      <p>There are several challenges in answering this research question. These
include: gathering a su cient number of patient records, deriving suitable training
features from the pharmacy dataset and devising a strategy in order to reduce
complexity of the predictive model. In order to address these issues, and better
answer the research question, the following objectives were devised: (i) Identify
the most important patient features in the dataset for predicting diabetes, (ii)
Construct a set of models using the features identi ed in objective one and (iii)
Evaluate the models, and compare them against the current state of the art
research.</p>
      <p>In section two of this study the related works and state of the art models
will be discussed. Section three will cover the methodology and implementation
for the approach in this study and section four will evaluate the results of this
study and discuss the implication of the results achieved.
2</p>
      <sec id="sec-2-1">
        <title>Related Works</title>
        <p>
          Advances in computer science have allowed for the crossover of medical
expertise and machine learning for classifying diabetes. Models such as arti cial neural
networks, support vector machines, random forests, bayesian networks and
logistic regression have been used to tackle this problem [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. In the following section,
random forests and arti cial neural networks are identi ed as having the most
accurate predictive ability for classifying diabetes. Neural networks work well
when dealing with complex medical datasets where relationships between
conditions are not always understood. Random forests o er similar advantages, as
well as being suitable where training datasets are limited in sample size, which
is often the case when dealing with clinical healthcare datasets.
        </p>
        <p>
          Arti cial Neural Networks
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] have demonstrated the success of a backpropagation neural network in their
approach for predicting diabetes. The paper proposes an 8-10-1 (1 hidden layer
with 10 nodes) network, which employs a Levenberg-Marquardt algorithm. This
method has an iterative process, which nds the local minimum of the model to
best t the data. Similarly, [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] apply a backpropagation approach for calculating
the gradient of the cost function in a small-word feedforward neural network for
classifying diabetes. Unlike the Lavenberg-Margquart algorithm demonstrated
by [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], a bipolar sigmoid transfer function is used to activate the neurons. The
sigmoid function optimises the model, making it a computational less intensive
approach. In contrast, [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] con gure their diabetes predictive neural network
with 2 hidden layers, as opposed to 1, with 8 input nodes. In both instances,
the researchers demonstrate how the proposed models achieve a high degree of
accuracy of 93% and 81% respectively. As shown by [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], neural networks have
outperformed bayesian networks for classifying diabetes patients. The researchers
demonstrate that a naive bayes approach assumes independent between
features, making it computationally less expensive than arti cial neural networks.
Although this reduces execution time, the researchers conclude that this
assumption may lead to less accuracy, making the neural network a more favourable
approach when dealing with predicting diabetes. A similar benchmarking exercise
conducted by [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] also shows how an arti cial neural network outperforms a
logistic regression model for predicting diabetes. The authors con gure a 3-layered
perception neural network with 2 hidden layers and achieve an accuracy of 86%
versus that of 78% achieved by the logistical regression model. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] also apply a
neural network to classify diabetes and use a genetically optimised algorithm for
maximising the accuracy of the model. The genetic algorithm works to nd the
optimum weights and bias for the nodes by applying a tness function to
minimise the mean squared error between the predicted and actual classi cation of
diabetes status during the training phase. This approach, coupled with feature
selection results in a classi cation accuracy of 96%. A similar approach is taken
in research conducted by [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], who use a convolutional neural network (CNN)
for diabetes prediction. The CNN is a feedforward neural network, which has a
multiple hidden layers, unlike the models con gured by [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The authors
illustrate the advantages of the CNN, which allows for meaningful insights into
the relationship between di erent features, without any prior pre-processing.
This omits the need for feature selection, such as that conducted by [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], thus
reducing implementation time of the model.
2.2
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Random Forest</title>
      <p>
        Researchers have demonstrated the use of random forests in the prediction of
diabetes with positive results when compared with other models. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] illustrate
this through their random forest model, which returned a better recall, precision
and speci city when benchmarked against naive bayes and logistic regression. In
this random forest model, an accuracy of 93% is achieved. Classi cation rates
in the study of [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] also show a high level of accuracy of 89%. Similarly to [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
the researchers show how random forests outperform logistic regression as well
as support vector machines. The study also highlights the random forests ability
to learn without any underlying assumptions of feature importance. In contrast
to [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] use an ensemble method combining classi cation and regression trees
(CART) and random forest. CART is used to nd the maximum average purity
in splitting the nodes of the decision tree. The authors demonstrate how using
this combined method overcomes accuracy problems encountered in other
studies. Model optimisation was performed on 600 training datasets and 100 test
datasets. As more trees are introduced the researches show how the classi er
error rate decreases, with a maximum accuracy of 84% achieved. In contrast,
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] use a CART and random forest model in isolation and compare the results
of the two models. The researchers found that the random forest outperforms
the CART approach with an accuracy of 75% versus 65% respectively. Random
forests have also been compared against adaptive boosting and iterative
dichotmiser 3 algorithms. A study by [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] demonstrates this by benchmarking a random
forest model for predicting diabetes against these algorithms. The researchers
achieve a better prediction accuracy for the random forest across a number of
different model con gurations. The researchers conclude that because the random
forest is more sensitive to early warning signs of diabetes, performance increases
signi cantly as the volume of training data is increased when compared against
the other models. The study nds a maximum classi cation accuracy of 84%, in
contract to the next best model, the adaptive boosting model, which achieves
an accuracy of 82%.
2.3
      </p>
    </sec>
    <sec id="sec-4">
      <title>Related Works Findings</title>
      <p>
        In the literature the use of pharmacy data has been under explored as a means
of classifying diabetes. This gap in the research will be assessed in this study in
order to determine how well pharmacy data can be used to tackle such problems.
Many of the models in the literature also prede ne the training features for
diabetes classi cation. In contrast, this study implements a feature selection
process, rather than eliminating certain variables from the beginning. This may
uncover new knowledge about diabetes predictive features. The set of models
used in this study use the best in class machine learning methods identi ed
in the literature. A random forest and arti cial neural network are applied. In
this study, the random forest takes a similar approach to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and the arti cial
neural network is con gured with a logistical activation function like [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], as this
has proven to achieve the best classi cation rate for diabetes diagnosis. As with
the [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] model, the number of trees chosen for the random forest is increased
until no more signi cant accuracy is gained. For the arti cial neural network
a node pruning approach is also applied in order to nd the optimum model
con guration.
      </p>
      <sec id="sec-4-1">
        <title>Methodology &amp; Implementation</title>
        <p>
          The approach taken in this study to address the research question was through
the use of two machine learning models: a random forest and arti cial neural
network. In order to test the proposed dataset, patient medical records were
obtained from 42 pharmacies across Ireland. In total 15,812 records were
processed and transformed into the correct format for training the models. Once
the models had been trained, a graphical user interface was created, where new
data could be tested.
This research was conducted using the R open source programming
language, through the RStudio integrated development environment [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. The
extract, transform, load process, model training and model application were all
developed in R. The following sections will describe the implementation in
detail and outline how the random forest and arti cial neural network models were
con gured.
In order to derive the most important features for the training models, the
fscaret package was used [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. Fscaret is designed for automatic feature selection
through the use of a variety of models. By comparing the results of multiple
models a smoothing a ect is achieved for nding the important features of a
classi cation problem. Each variable is scaled according to its mean-square error
(MSE) and root-mean-square error (RMSE) for comparison. The models
chosen for this feature selection implementation were: Gradient Boosting Machine
(gbm), generalised linear model (glm), neural network (neuralnet), random forest
(rf) and treebagging. Once the models were trained, the results were presented
in a matrix listing the feature importance. Each variable was ranked based on its
weighting across each of the models according to its MSE and RMSE. In order
to decide which variables would be kept, and which ones would be ommited, a
variable importance plot was produced and a cut-o point for the model features
was de ned.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Random Forest</title>
      <p>
        The training dataset for the random forest consisted of 15,812 patient records.
This was evenly split between diabetic and non-diabetic patients. The data was
then partitioned into a 70/30 split for the training and testing phases. Using
the randomForest package [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], the model was implemented and a series of
decision trees were produced. By using this ensemble method, each tree produces a
slightly di erent result. These results are then combined and the mode decision
of all the trees is taken. By doing this, the issue of over tting by a single
decision tree is addressed. The variables used in the random forest training model
were those that were de ned during the feature selection stage. Using the
bootstrap aggregation function, multiple random samplings of patient records were
taken from the training dataset. In total, 30 bootstrap replications were taken to
produce 30 individual decision trees. Beyond this point there was no signi cant
performance gain from adding additional trees, only an increased computational
cost. Once the model had been computed, the predict function was used on the
testing dataset to classify each of the patient records as diabetic or non-diabetic.
These result were then compared against the actual diabetes status for each
patient. In order to better understand the model, the decision trees produced by
the random forest were plotted using the partykit package [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
3.3
      </p>
    </sec>
    <sec id="sec-6">
      <title>Arti cial Neural Network</title>
      <p>
        The same training dataset was taken for the arti cial neural network and was
also partitioned into a 70/30 split for training and testing. The neuralnet package
developed by [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] was used to implement the model. A backpropagation algorithm
was used to retroactively recalculate the weights of each of the nodes. A variety of
con gurations were tested to optimise the number of hidden layers and the nodes
within each layer. The models tested were all con gured with 9 input nodes, as
the number of training features selected remained the same for each iteration.
Models with one hidden layer and two hidden layers were tested with no increase
in accuracy being achieved by the two-layer con guration. For this reason, the
one hidden layer model was selected to improve computational e ciency. The
number of nodes within the hidden layer was also adjusted to nd the optimal
network performance. A series of tests and node pruning found that 4 nodes
within the hidden layer achieved the highest accuracy results. This resulted in an
9-4-1 (one hidden layer with four nodes) neural network. The activation function,
which is used to convert the input signal of each node into an output signal, was
set to a logistic function for smoothing the results and the threshold for the
error of the partial derivatives was set to 0.01, as this con guration achieved
the greatest accuracy in the related works. The error of the neural network was
calculated by the sum of squared errors. Once the model had been trained, its
performance was measured using the test dataset. The model computed the
diabetes status for each patient record, and the results were compared with the
actual diabetes status for each patient to calculate the precision, recall, F1 score
and accuracy of the model. As with the random forest, the neural network was
plotted to visualise the model con guration.
      </p>
      <sec id="sec-6-1">
        <title>Evaluation</title>
        <p>In this section the outcome of the feature selection process is discussed, alongside
an evaluation of the prevalence of these features in the current literature. The
results of the random forest and arti cial neural network are also presented and
the models are tested against a number of performance metrics. These ndings
are then further explored and compared in the context of similar studies in the
discussion section.
4.1</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Feature Selection</title>
      <p>In total, 84 features were available in the dataset during the feature selection
process. The percentage for how much each feature contributed to the chosen
models was calculated and summed across the models. The cut-o point was
chosen at the rst 9 features as the addition of subsequent features only had a
minor contribution to the model. These feature selected were: Age, sex, diuretic
medications (used to treat high blood pressure), intestinal secretion medications
(used to help digest fats and regulate cholesterol), lipid-regulated drugs (used to
treat cardiovascular diseases and alleviate high cholesterol), hypertension
medications (used to treat high blood pressure), rectal disorder medication (used
to treat rectal disorders), positive isotopic drugs (used to treat heart failure by
increasing the strength of cardiovascular contractions) and laxatives (used to
treat or prevent constipation).
4.2</p>
    </sec>
    <sec id="sec-8">
      <title>Random Forest</title>
      <p>The random forest was carried out using the variables identi ed in the feature
selection process. In total, 30 decision trees were con gured as part of the
bootstrap aggregation process. The results of the random forest model are presented
in Table 1 and Table 2. These show an overall accuracy for the model of 77.1%.
The recall and the false positive rate were also plotted against each other to
calculate the receiver operating characteristic curve (ROC). The results of the
ROC curve show an area under the curve of .803 (Figure 2) for the random
forest.
4.3</p>
    </sec>
    <sec id="sec-9">
      <title>Arti cial Neural Network</title>
      <p>The input features used in the neural network were also those identi ed during
the feature selection process. A number of di erent parameters were tested for
the neural network to nd the optimal con guration. The number of hidden
layers, and nodes within the hidden layer, were adjusted until the most accurate
model of 1 hidden layer with 4 nodes was found. A 9-4-1 con guration for the
model was de ned.</p>
      <p>The neural network is split up into three layers. The input layer represents
where the features for the patients are fed into the model. Each feature is
represented by an individual neuron in the input layer. The second layer is the
hidden layer, which was con gured with 4 neurons. The nal layer is the output
layer where the diabetes classi cation is predicted for the patients. The yellow
neurons represent the values of the weighted connections between the neurons,
which were initially randomised before converging at their nal values. Each
neuron in the input layer has a connection to the hidden layer with a corresponding
weight. The sum of the value of each neuron and its connected neurons are added
together and multiplied by their connection weights. This produces a bias value
that is then put into an activation function, which transforms the value. The
activation function then propagates through the network to produce a diabetes
classi cation at the output layer. As the network is computed, the weights are
iteratively adjusted through backpropagation to nd the best t for the model.
The results of the neural network are illustrated in Table 3 and Table 4.
The aim of this study was to see to what extent a pharmacys patient medical
records could be used to aid the prediction of type II diabetes. By leveraging
the capabilities of machine learning, the results have demonstrated a successful
approach for a random forest and arti cial neural network. The neural network
achieved a classi cation accuracy of 76.4%, with the random forest achieving a
slightly greater accuracy of 77.1%. These results present a new novel means of
diabetes classi cation, not previously identi ed in the current literature. Although
the accuracy of random forest and neural network did not outperform the state
of the art models (Figure 5), these results still demonstrate how pharmacy data
can give su cient accuracy. However, the results obtained in this study are still
around 16% less accurate than the best state of the art models. This is because
clinical data has features such as family history and body mass index which are
important indicators in such diseases.</p>
      <p>The features identi ed in this study further validate the known diabetes risk
factors identi ed in the related works chapter. The presence of patients taking
lipid-regulated drugs and intestinal secretion drugs were identi ed as important
features in the random forest and arti cial neural network. These drugs are
commonly used for treating high cholesterol, which is a major risk factor for
diabetes. Similarly, hypertension and diuretic medications that are used to treat
high blood pressure were also identi ed as a major indicator for diabetes by
the models. This coincides with the known medical literature that persons with
elevated blood pressure are at greater risk of developing diabetes. Age was also
an important feature in both the random forest and arti cial neural network in
this study. Age is a commonly understood variable linked to diabetes risk, with
an increase of age corresponding to an increased risk of developing the disease.
The data used to train the models in this study showed a sharp increase in
the number of patients with diabetes above aged 58, with a signi cantly lower
number of instances of diabetes in younger patients.</p>
      <p>
        Although the major variables identi ed in the models were consistent with
the existing literature, an unexpected feature was identi ed. The presence of
laxatives was not identi ed as a predictive feature for any of the other models
reviewed in the related works chapter. However, studies by [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] and [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] have
shown a link between diabetes and constipation. A consequence of diabetes is
a high blood glucose level, which can ultimately lead to nerve damage of the
digestive tract that can result in constipation. Laxatives are commonly used to
treat constipation, thus potentially making them a useful feature for predicting
diabetes as demonstrated in this study. Evidence from existing research, such
as that by [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], also draws a relationship between constipation and high fat
diets. These high fat diets can lead to high cholesterol, which further strengthens
the nding that there may be a link between constipation and diabetes, as high
cholesterol is an already known risk factor for the disease. The relationship
between diabetes and constipation is relatively unexplored in the current literature
and further research is needed in order to validate the prevalence of any such
relationship.
      </p>
      <p>This study suggests using the random forest and neural network models
developed as a preliminary identi cation for diabetes patients, and not an outright
diagnosis tool. Pharmacists should identify patients classi ed by the models as
at-risk of diabetes and invite them in for a fasting blood glucose level test to
conrm whether or not they do have diabetes, or if they fall within the pre-diabetes
range. The results of these tests could then be introduced back into the training
dataset to further improve the classi cation accuracy of the models and reduce
the number of false positive diabetes classi cations. Coupled with this data, the
models could be re ned even further to produce a three-way classi cation for
identifying non-diabetic, pre-diabetic and diabetic patients.</p>
      <sec id="sec-9-1">
        <title>Conclusion</title>
        <p>The primary goal of this study was to answer the research question: \To what
extent can pharmacy patient medical records be used to predict to type II diabetes? ".
This study found that a random forest and arti cial neural network could be
trained with pharmacy medical records to achieve an accuracy of 77.1% and
76.4% respectively. Although these results represent su cient accuracy, they
did not outperform the state of the art models, which use clinical healthcare
datasets. Although there was a loss of accuracy of around 16% versus the state
of the art models, this lack of detail is made up for by the wide availability of
pharmacy medical records. This opens up a greater opportunity for the set of
models in this study to be used on a wider audience outside of clinical healthcare
settings. This model could be used as a preliminary diagnosis tool, and could
identify patients who need further assessment. This could greatly reduce the cost
and time taken to identify at-risk patients, as classi cations could be run on a
larger scale. As diabetes is a progressive disease, early diagnosis is critical for
the long term wellbeing of patients. This model could help aid earlier diagnosis,
which could ultimately lead to better patient outcomes.
5.1</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Future Works</title>
      <p>This study has demonstrated the potential of pharmacy data, and its application
for predicting diabetes. Further research is needed in this area to see what other
conditions can be predicted through machine learning methods using pharmacy
medical records. It is also suggested that pharmacy data could be used in
conjunction with clinical healthcare datasets to improve existing diabetes prediction
models. More research is also needed to better understand the important risk
factors of diabetic patients, and in particular the prevalence of laxatives and
their relationship to diabetes, as identi ed in this study.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alic</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurbeta</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Badnjevic</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Machine learning techniques for classi cation of diabetes and cardiovascular diseases</article-title>
          .
          <source>In: 2017 6th Mediterranean Conference on Embedded Computing</source>
          . https://doi.org/10.1109/meco.
          <year>2017</year>
          .7977152
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Borse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Title of a proceedings paper</article-title>
          . In:
          <article-title>Detection and Prediction of Diabetes Mellitus Using Back-Propagation Neural Network</article-title>
          . https://doi.org/10.1109/icmete.
          <year>2016</year>
          .11
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Sabariah</surname>
            ,
            <given-names>M. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hanifa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sa</surname>
          </string-name>
          , S.:
          <article-title>Title of a proceedings paper</article-title>
          . In:
          <article-title>Early detection of type II Diabetes Mellitus with random forest and classi cation and regression tree (CART)</article-title>
          . https://doi.org/10.1109/icaicta.
          <year>2014</year>
          .7005947
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Rallapalli</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suryakanthi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Predicting the risk of diabetes in big data electronic health Records by using scalable random forest classi cation algorithm</article-title>
          .
          <source>In: 2016 International Conference on Advances in Computing and Communication</source>
          Engineering (ICACCE). https://doi.org/10.1109/icacce.
          <year>2016</year>
          .8073762
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <surname>J.</surname>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>Q.</given-names>
            ,
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          :
          <article-title>Title of a proceedings paper</article-title>
          . In:
          <article-title>Risk prediction of type II diabetes based on random forest model</article-title>
          . https://doi.org/10.1109/aeeicb.
          <year>2017</year>
          .7972337
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Manajan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohit</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Diagnosis of diabetes mellitus using PCA and genetically optimized neural network</article-title>
          .
          <source>In: 2017 International Conference on Computing, Communication and Automation (ICCCA)</source>
          . https://doi.org/10.1109/ccaa.
          <year>2017</year>
          .8229838
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Gong</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barnes</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>HCNN: Heterogeneous Convolutional Neural Networks for Comorbid Risk Prediction with Electronic Health Records</article-title>
          . In: 2017 IEEE/ACM International Conference on Connected Health:
          <article-title>Applications, Systems and Engineering Technologies (CHASE)</article-title>
          . https://doi.org/10.1109/chase.
          <year>2017</year>
          .80
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. neuralnet: Training of Neural Networks, https://CRAN.Rproject.org/package=neuralnet .
          <source>Last accessed 16 June 2018</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <source>Global Report on Diabetes, https://bit.ly/1N8hn84. Last accessed 10 June 2018</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Adavi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salehi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roudbari</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Arti cial neural networks versus bivariate logistic regression in prediction diagnosis of patients with hypertension and diabetes</article-title>
          .
          <source>Medical Journal of The Islamic Republic of Iran</source>
          <volume>30</volume>
          (
          <issue>312</issue>
          ), (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Alghamdi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Al-Mallah</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keteyian</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brawner</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ehrman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakr</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Predicting diabetes mellitus using SMOTE and ensemble machine learning approach: The Henry Ford Exercise Testing (FIT) project</article-title>
          .
          <source>PLOS ONE 12</source>
          (
          <issue>7</issue>
          ), (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Erkaymaz</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ozer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perc</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>: Performance of small-world feedforward neural networks for the diagnosis of diabetes</article-title>
          .
          <source>Applied Mathematics and Computation</source>
          <volume>311</volume>
          ,
          <issue>22</issue>
          {
          <fpage>28</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torrent-Fontbona</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manuel</surname>
            Fernandez-Real,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Single Nucleotide Polymorphism relevance learning with Random Forests for Type 2 diabetes risk prediction</article-title>
          .
          <source>Arti cial Intelligence in Medicine</source>
          <volume>85</volume>
          ,
          <volume>43</volume>
          {
          <fpage>49</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Hothorn</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zeileis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Partykit: A Modular Toolkit for Recursive Partytioning</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>16</volume>
          (
          <issue>1</issue>
          ),
          <fpage>3905</fpage>
          -
          <lpage>3909</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Hanna</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guthrle</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Health-Compromising Behaviour and Diabetes Mismanagement Among Adolescents and Young Adults With Diabetes</article-title>
          .
          <source>The Diabetes Educator</source>
          <volume>7</volume>
          (
          <issue>22</issue>
          ),
          <fpage>223</fpage>
          -
          <lpage>230</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Taba Taba Vakili</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nezami</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shetty</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chetty</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srinivasan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Association of high dietary saturated fat intake and uncontrolled diabetes with constipation: evidence from the National Health and Nutrition Examination Survey</article-title>
          .
          <source>Neurogastroenterology Motility</source>
          <volume>27</volume>
          (
          <issue>10</issue>
          ),
          <volume>1389</volume>
          {
          <fpage>1397</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Krishnan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Gastrointestinal complications of diabetes mellitus</article-title>
          .
          <source>World Journal of Diabetes</source>
          <volume>4</volume>
          (
          <issue>3</issue>
          ), (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Author</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Risk factors for chronic constipation based on a general practice sample</article-title>
          .
          <source>The American Journal of Gastroenterology</source>
          <volume>98</volume>
          (
          <issue>5</issue>
          ),
          <volume>1107</volume>
          {
          <fpage>1111</fpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>A</given-names>
            <surname>Language</surname>
          </string-name>
          and
          <article-title>Environment for Statistical Computing</article-title>
          , http://www.Rproject.org/.
          <source>Last accessed 7 May 2018</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <source>Automated Feature Selection from 'caret'</source>
          , https://CRAN.Rproject.org/package=fscaret.
          <source>Last accessed 18 June 2018</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <article-title>Classi cation and Regression by randomForest</article-title>
          , http://CRAN.Rproject.org/doc/Rnews/.
          <source>Last accessed 30 May 2018</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>