1. Introduction

Rajni Bala, Dharmender Kumar, Classification Using ANN: A Review, International Journal of Computational Intelligence Research

0973-1873

Prediction and Detection of Diabetes using Machine Learning

Olta Llaha

olta.petritaj@fshn.edu.al 0

Amarildo Rista

amarildorista@gmail.com 0 0 South East European University , Arhiepiskop Angelarij, Skopje 1000, North Macedonia, Skopje, North Macedonia

2019

16 7 7 9

Data mining and machine learning have become a vital part of different disease detection and prevention. One of them is diabetes. The purpose of this paper is to evaluate data mining methods and their performances that can be used for analyzing the collected data about the diabetes. We identified the most appropriate data mining methods to analyze the data by comparing them theoretically and practically. Some attributes of this dataset are: Age, Body Mass Index, Insulin, Glucose, etc. Methods are applied on these data to determine their effectiveness in analyzing and preventing diabetes. Evaluations on the data showed that the method with a higher performance is “Decision Tree”. This was achieved by some performance measures, such as the number of instances correctly classified, accuracy, precision, recall and F-measure, that has brought better results compared to other methods. We come to the conclusion that the data mining methods and machine learning contribute to the predictions on the possibility of occurrence of the diabetes.

1 Machine Learning Prediction Diabetes Disease Data Mining

1. Introduction

Diabetes is a disease that is increasingly affecting the world even the most developed countries. Diabetes by the nature of its development as a globally problematic disease requires maximum commitment from medical staff, patients, family and society. Diabetes is a disease with high social, health and economic costs [ 1 ]. Diabetes is a chronic disease characterized by an increase in glucose or blood sugar levels because the body cannot produce insulin or its production is insufficient, or insulin is not able to act on the cells of the organism [ 2, 3 ]. Medics still do not know exactly why such a thing is happening and they have called the cause: x syndrome. Historically diabetes treatment has been done by fighting the symptoms and not the cause. According to the World Health Organization, Diabetes affects about 5% of the world's population and the number of patients is constantly increasing [ 1 ]. In developed countries, diabetes and the largest number of diabetics are found in people over 65 years of age. Whereas in developing countries where our country is part of the largest number of diabetics is found in the age of 45-64 years, but in recent years type 2 diabetes is more commonly encountered also in the age of 30-40 years [ 1 ]. The availability of historical data naturally leads to the application of data mining techniques for pattern discovery. The goal is to find rules that help understand diabetes and make it easier to diagnose it sooner. Prevention of diabetes is of great interest in the field of medicine. The use of data mining accelerates data analysis, and analysts can examine existing data to identify patterns and trends of diabetes.

This paper is structured as follows: Section. 2 describes the relationship that exists between data mining, machine learning and medicine. The methodology and description of the dataset are described in Section. 3. Sections. 4 and 5, represent a theoretical description of the methods and algorithms that will be applied practically to our data. Section 6 presents the results of the application of algorithms and an explanation for the algorithm with the best results. In sect. 7 the conclusions and future work are discussed.

2. Using Data Mining and Machine Learning in Medicine

Medicine is the science and practice of establishing the diagnosis, prognosis, treatment, and prevention of disease. Medicine encompasses a variety of health care practices evolved to maintain and restore health by the prevention and treatment of illness [ 4 ]. This is one of the most important areas when applying data mining techniques can produce significant results [ 5 ].

With data mining techniques, doctors will be able to predict illnesses effectively and they will be better equipped to manage potential high-risk candidates [ 6 ]. The high volume of diseases data and the complexity of the relationships between them have made medicine an appropriate field for applying data mining techniques. Data mining can be used to examine many large datasets involving a large set of variables beyond what a single analyst or doctor, or even an analytical team can. Like any other problem solving method, the task of data mining begins with a problem definition. The identification of the data mining problem enables the determination of the data mining process and the modeling technique. Machine learning is a subfield of data science that deals with algorithms able to learn from data and make accurate predictions. Data mining gives health organizations the opportunity to learn about disease trends etc. By using data mining methods and machine learning algorithms we improve diabetes analysis and we help to reduce and prevent it.

3. Data and Methodology

We compare theoretically and practically data mining methods to discover the most appropriate method for our data. The methods were compared by applying machine learning algorithms to concrete data in the WEKA “Waikato Environment for Knowledge Analysis” [ 7 ] environment. The implemented algorithms are: Simple Logistic, Multilayer Perceptron, Logistic, Naive Bayes, Bayes Net, SMO, C4.5.In Figure 1, we explain all the stages of this study from predicting diabetes using data mining methods and machine learning algorithms of these methods.

In data gathering step we collect data from the sector of statistics of the Public Health Institute. The collected data is stored into database for further process. The dataset is made up of 270 records or instances.

The variables or attributes of this dataset are: 1) Age: As you age, your risk of diabetes increases, especially when you are over 45 years old., 2) Body Mass Index: It is an indicator of weight (underweight, normal, overweight) based on length and weight. Given weight (kg) / (length m) 2. Ideal BMI values are 18.5-24.9. If we have values 25-29.9 the person is considered overweight, 30-39.9 indicates obesity and 40+ significant obesity. 3) Insulin: Serum Insulin in two hours. Values higher than 150μU/ ml mean that a person needs insulin therapy, therefore he is pre-diabetic or diabetic., 4) Glucose: Glucose tolerance test values (glucose value mg / dl 2 hours after 75 g glucose) A person is said not to suffer from diabetes if the tolerance test value at two hours is less than or equal to 110 mg / dL (Norman 1)., 5) Skin Thickness: Triceptal Muscle Thickness (mm) - Indicative value of 23 mm overweight for women, values higher than normal indicate that the person is overweight., 6) Blood Pressure: Diastolic blood pressure (mm Hg) Normal blood pressure values are: 6080 mm Hg, 80-89 indicates pre-hypertension and 90+ hypertension., 7) Number of pregnancies: A woman can be diagnosed with diabetes Gestational during pregnancy. Hormones produced during pregnancy can make cells more resistant to insulin. Those who are older than 25 have a higher risk. Moreover, if a woman has diabetes during one pregnancy, there is an increased risk at the next pregnancy (Diabetes-Bing Health). 8) Outcome: negative when the person is not diagnosed with diabetes and positive when the person is diagnosed with diabetes. The experiments were conducted with a female population over 19 years of age. Diabetes dataset is in CSV format.

4. Classification

Classification is a data mining technique that categorizes data in order to assist in more accurate predictions and analysis [ 8 ]. It is one of the data mining methods that aims to analyze very large datasets. It is used to derive patterns that accurately define the important data classes within the data set. Classification techniques predict the target classes for each of the present data instance. [ 9 ]. Classification algorithms attempt to detect relationships between attributes that would make it possible to predict the result. They analyze the input and produce a prediction. The classification task of data mining is generally used in healthcare industries [ 9 ]. 4.1.

Naïve Bayes

Bayesian classification represents a supervised learning method as well as a statistical classification method. The Naive Bayes Classifier technique is based on the ( | = Bayesian theorem and is used especially when the dimensionality of the inputs is high [ 10 ].

Bayesian classification provides practical learning algorithms and prior knowledge, here the observed data can be combined. It calculates the apparent hypothetical probability. The algorithm works as follows. Bayes' theorem offers a way to calculate the probability of a hypothesis based on our prior knowledge. It works based on conditional probability [ 11 ]. It can be represented as: ( | ) ( )

( )

Here M and N are two events and, P(M|N) is the conditional probability of M given N. P(M) is the probability of M. P(N) is the probability of N. P (N|M) is the conditional probability of N given M.

Naive Bayes is a strong and powerful predictor. This technique can be useful for very large number of data sets [ 12 ]. The Naive Bayesian classifier is fast and incremental and it can deal with discrete and continuous attributes. It has excellent performance and it can explain the decisions. 4.2.

Support Vector Machine

SVM classifier is a supervised learning algorithm based on statistical learning theory introduced by Vepnik (Vapnik, 1995) [ 13 ]. The main idea behind this method is to determine a hyperplane that optimally separates two classes using training dataset. SVM is a set of related supervised learning method used in medical diagnosis for classification and regression [ 14 ]. Support Vector Machine (SVM) model is the representation of examples defined as points in space that are mapped so that the examples of the different categories can be divided by a clear gap that is as large as possible [ 15 ]. SVM also supports regression and classification techniques and can handle multiple continuous and categorical variables. The efficiency of SVM-based classification is not directly dependent on the dimension of the classified entities. This algorithm achieves high discriminative power by using special nonlinear functions called kernels to transform the input space into a multidimensional space [ 16 ]. It can be seen that the choice of kernel function and best value of parameters for particular kernel is critical for a given amount of data [ 16 ]. It also normalizes all attributes by default.

The decision tree

Decision tree model has a tree structure, which can describe the process of classification instances based on features [17]. It splits the data in the database into subsets based on the values of one or more fields. This process will be repeated for each subgroup recursively until all instances are a node in a single class. The result of the decision tree is a tree-shaped structure that describes a series of decisions given at each step [17]. Decision trees are easy to interpret and understand. They provide white box structure for each provided dataset and can be combined with any other data mining techniques [18]. The typical algorithms of decision tree are ID3, C4.5, CART and so on. In this study, we used the C4.5 algorithm. The C4.5 is a fraction between information gain and its splitting information. It selects the attribute value of the data that most effectively separate the tested data into subset data which enriched the class.

The tree is generated by the normalized information gain [19]. The C4.5 inductive algorithm

generates rules from a single tree. It can transform multiple decision trees and create a set of classification rules. Such features of this algorithm can be used to scale general rules, instruction time, size, and number of rules. This algorithm fits to medical records because it copes with missing values. Furthermore the algorithm handles continuous data which is common in medical symptoms. Random Forest is a method of classification which combines hundreds or thousands of decision trees and it trains each one of them on a slightly different set of the observations, splitting nodes in each tree considering a limited number of the features [20]. The final predictions of the Random Forest method are made by averaging the predictions of each individual tree. It is fast and easy to implement, and it produces highly accurate predictions and also it can handle a very large number of input variables without over-fitting [21]. 4.4.

Artificial Neural Network

Neural networks are an area of Artificial Intelligence (AI), where based on the inspiration we have from the human brain [22]. Applying neural network techniques, a program can learn from the examples and create an internal set of rules for classifying different inputs. All processes of a neural network are performed by this group of neurons or units [22]. Each neuron is a separate communication device, making its operation relatively simple. The function of one unit is simply to receive data from other units, as a function of the inputs it receives to calculate an output value, which it sends to other units. In artificial neural networks, neurons are organized in layers which process information using dynamic state responses to external inputs [17]. Artificial neural network is an example of supervised learning (ANNs) observations

Neural classification, [23]. are

network

Artificial neural

networks capable from of predicting

new existing

observations. method is used for clustering, feature mining, prediction and pattern recognition. One of the most used Neural Networks is the Multilayer Perceptron (MLP), in which its neurons apply a nonlinear activation function to calculate their outputs [24]. The activation function includes a sigmoid function (f(x) = 1 / (1 + exp (-x))) in the hidden layer and a linear function (fj(x) = Σp i=1wijxi, where xi's are predictor variables and wij's are input weights) in the output layer. The functional form of the MLP can be written as: =1 = ( ∑ ji + j) where xi is the i-th nodal value in the previous layer, yj is the j-th nodal value in the present layer, bj is the bias of the j-th node in the present layer, wji is a weight connecting xi and yj, N is the number of nodes in the previous layer, and f is the activation function in the present layer [24].

5. Association Regression Rules and Association

Rule is one of the most important canonical tasks in data mining and probably one of the most studied techniques for pattern discovery. Association rules are if/then statements that help to uncover relationships between unrelated data in a database, relational database or other information repository [ 25 ]. Association Rules identify the arguments found together with a given, event or record: "the presence of one set of arguments brings the presence of another set". This is how rules of type are identified: "if argument A is part of an event, then for a certain probability argument B is also part of the event" [ 26 ]. Association also has great impact in the health care industry to discover the relationships between diseases, state of human health and the symptoms of disease [ 27 ]. It can be used to detect and study the etiological pathways in the populations as they suggest interconnections of various risk factors responsible for a disease and are easily interpretable [ 26 ].

The objective of the association rule

was to discover interesting association or correlation relationships among a large set of data items. Support and confidence are the most known measures for the evaluation of association rule. trees instead

While classification provides categorical, discrete labels, regression has continuous function values. So regression is used mainly to predict missing numeric data values rather than discrete class labels. Regression analysis is a statistical technique for examination of connection between the dependent variable and independent variable, which aims to predict the dependent variable from the independent variable or variables [ 28 ]. Regression also involves identifying the distribution of trends based on available data. For this purpose regression trees can be used as well as decision whose nodes have numerical values of categorical values.

Logistic regression used to estimate the probability of occurrence of a specific event and the dependent variable is odds ratio which is another way of expressing possibility. This model can be taken into account as the generalized linear model as a link function and its mistake following of the polynomial distribution [ 28 ].

This model as: = log ( ) = = + 1 1,j + ⋯ + k k,

1 − = 1 … = r(Yi=1) = r(Yi=1|X)=

Is that

+ 1 1,i+⋯+ k k,i 1 + + 1 1,i+⋯+ k k,i Where: P = is the probability that an example belongs to a particular category, e = base of natural algorithm (~ 2.72), α = constant of the equation, β = coefficient of the predictor variables.

6. Experimental Results

To conduct this study we used WEKA [ 7 ] software based on the approach and familiarity with its use. WEKA is an open source tool for data mining, which allows users to apply preprocessing algorithms but it does not provide assistance in terms of which one to apply. However, algorithms since have different different data regarding the dataset, some preprocessing is applied algorithms.

by default Data inside

some preprocessing of the includes cleaning, instance selection, normalization, transformation, feature extraction, selection, etc. Data preprocessing affects the way in which outcomes of the final data processing can be interpreted.

WEKA software package has

different programs for different techniques and algorithms.

Experiments are done by using Crossvalidation on default option folds= 10. Cross validation helps to improve the model results. The 10-fold cross validation technique has been used for better predictions. We have divided our dataset in to 10 samples. Each sample had to go from the process of retained as a validation data, where the rest 9 samples acted as a training data. This was a 10 times vice versa process. That's why it is call 10-fold cross validation.

The advantage gained by this process step is that it cuts down the bias association with random sampling methods.

Different classification algorithms were applied on our dataset, and the results for all methods were slightly different as the working criteria of each algorithm is different. The results were evaluated on the basis of correctly classified instances, accuracy, precision, recall and fmeasure. Performance indicators are given on the following Table 2 and Table 3

This algorithm is clear and easy when we use it to interpret the results. It selects the attribute value of the data that most effectively separates the tested data into subset data which enriches the class. The model construction is done by modifying the parameter values and this algorithm classifies diabetes disease data with a higher accuracy than other algorithms of data mining methods. This is shown in Table 3, it is the comparison of Accuracy of models after the implementation of algorithms.

Accuracy of classifier refers to the ability of classifier. It predicts the class label correctly and the accuracy of the predictor refers to how well a given predictor can guess the value of predicted attribute for a new data. F-measure is a measure of a test's accuracy. It considers both the precision and the recall of the test to compute the score: precision is the number of correct positive results divided by the number of all positive results returned by the classifier, and recall is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive).

− = = + = = 2 ∗ + + + +

∗ + +    

True positive (TP): correct positive prediction False positive (FP): incorrect positive prediction True negative (TN): correct negative prediction False negative (FN): incorrect negative prediction

We converted our data to CSV format. The C4. 5 algorithm for building decision trees is implemented in WEKA as a classifier called J48. J48 has the full name weka.classifiers.trees.J48. What came out of this algorithm: the visualization and the decision tree are presented in Figure 4 and Figure 5.

The implementation of this algorithm has classified the diabetes data based on the dataset attributes where precision, recall and f-measure have the highest values compared to other algorithms of data mining methods. This is shown in Figure 3. Figure 5 shows the visualization of the decision tree which is generated by the implementation of the C4.5 algorithm.

7. Conclusion

The purpose of this article was to create a decision-making structure for diagnosing diabetes. This structure was realized through the study of classification data mining methods such as Naive Bayes, Decision Tree, Support Vector Machine (SVM), Logistic Regression and their evaluation to show the highest performing method on the dataset. The results of experiments conducted in this research by implementing algorithms of data mining methods have revealed that these methods are applicable in the process of diabetes prediction. The decision tree as a data mining classification method has classified diabetes data at an accuracy rate of 79%. This method has shown promising results for the problem of diabetes prediction as the accuracy rate is high in the experiments performed. Furthermore, the decision tree seems more viable due to the fact that in contrast to other algorithms, it expresses the rules explicitly. These rules can be expressed in human language so that anyone can understand them. Decision trees are easy to interpret and understand. The use of machine learning in analysis diabetes is important because data mining methods and machine learning can be used in the decision making process. In the future extension of this study some models will be created for predicting the diabetes that will help health centers, hospitals, etc. to create policies or make decisions about diabetes by preventing it. Algorithms’ behavior changes will be looked at when more data is added. In the future we plan to do the same study but this time not only on women but on all persons regardless of gender. We also intend to implement this study to an integrated Diabetes Decision Support System (DDSS) that we will create. 8. References

[1] http://www.ishp.gov.al/wpcontent/uploads/2015/kalendar/Dita%20b oterore%20e % 20diabetit .pdf

[2] https://www.familjadheshendeti. com/sem undja-e-sheqerit-diabeti-te-femrat/S.

[3]

He , Kuang-i Shu and Heng Zhang, Machine Learning and Data Mining in Diabetes Diagnosis and Treatment , IOP Conference Series: Materials Science and Engineering, Volume 490 , Issue

IOP

Conf . Series: Materials Science and Engineering 490 ( 2019 ) 042049 IOP doi: 10 .1088/1757899X/490/4/042049

[4] https://en.wikipedia.org/wiki/Medicine

[5] Ionita , Irina & Ioniță, Liviu. ( 2016 ). Applying Data Mining Techniques in Healthcare . Studies in Informatics and Control . 25 . 385 - 394 . 10 .24846/v25i3y201612.

[6] Bisandu , Desmond & Datiri, Dorcas & Onokpasa, Eva & Thomas, Godwin & Haruna, Musa & Aliyu, Aminu. ( 2019 ). Diabetes Prediction using Data mining Techniques . International Journal of Innovation Science . 4 . 103 - 111 .

[7] Frank , Eibe & Hall, Mark & Holmes, Geoffrey & Kirkby, Richard & Pfahringer, Bernhard & Witten, Ian & Trigg, Len. ( 2010 ). Weka-A Machine Learning Workbench for Data Mining . 10 .1007/978-0-387 09823 - 4 _ 66 .

[8] Pang-Ning Tan ; Michael Steinbach; Anuj Karpatne; Vipin Kuma Introduction to Data Mining 2nd ed , Publisher: Pearson , 2019 , Print

ISBN

: 9780133128901 , 0133128903 eText ISBN: 9780134080284 , 013408028 .

[9] Pandey , Dr. Subhash. ( 2016 ). Data Mining Techniques for Medical Data: A Review . 10 .1109/SCOPES. 2016 . 7955586 .

[10] Sisodia , Deepti & Sisodia, Dilip. ( 2018 ). Prediction of Diabetes using Classification Algorithms . Procedia Computer Science. 132 . 1578 - 1585 . 10 .1016/j.procs. 2018 . 05 .122.

[11] Tharak

Roopesh

, Asadi Srinivasulu andK.S.Kannan EasyChair , Prediction of Diabetes Disease Using Data Mining and Deep Learning Techniques , Easy hair Preprint, № 1608 , October

, 2019 .

[12]

Priyadarshini , I.Lakshmi , Predictive Analysis of Diabetes Using Bayesian Network and Naive Bayes Techniques , International Conference on Advancements in Computing Technologies - ICACT 2018 , Volume: 4 Issue: 2, ISSN: 2454 - 4248 .

[13] Giveki , D. , Salimi , H. , Bahmanyar , G. , & Khademian , Y. ( 2012 ). Automatic Detection of Diabetes Diagnosis using Feature Weighted Support Vector Machines based on Mutual Information and Modified Cuckoo Search . ArXiv, abs/1201.2173.

[14] Jegan , Chitra. ( 2013 ). Classification Of Diabetes Disease Using Support Vector Machine . International Journal of Engineering Research and Applications . 3 . 1797 - 1801 .

[15] Sahana

Shetty

, Kaveri

Kari and Jayantkumar. A.

Rathod , Detection of Diabetic Retinopathy Using Support Vector Machine (SVM ) , International Journal of Emerging Technology in Computer Science & Electronics (IJETCSE) ISSN: 0976-1353 Volume 23 Issue 6 -OCTOBER 2016 (SPECIAL ISSUE)

[16] Wei

, Tiebin Liu, Rodolfo Valdez, Marta Gwinn, Muin J Khoury, Application Methods in Prediction of Diabetes in Iran. Healthcare informatics research . 19 . 177 - 85 . 10 .4258/hir. 2013 . 19 .3.177.

[25] Kumbhare , Trupti A. and Santosh

Chobe . “An Overview of Association Rule Mining Algorithms .” ( 2014 ).

[26] Umang

Soni

, Sushma Behara, Karthik Unni Krishnan, Ramniwas Kumar, Application of Association Rule Mining in Risk Analysis for Diabetes Mellitus , International Journal of Advanced Research in Computer and Communication Engineering Vol. 5 , Issue

, April

2016

, ISSN (Online) 2278 - 1021 ISSN (Print) 2319 5940 .

[27] Patel , Sheenal & Patel, Hardik. ( 2016 ). Survey of Data Mining Techniques used in Healthcare Domain . International Journal of Information Sciences and Techniques . 6 . 53 - 60 . 10 .5121/ijist. 2016 . 6206 .

[28] Parastoo

RAHIMLOO

, Ahmad

JAFARIAN

, Prediction of Diabetes by Using Artificial Neural Network, Logistic Regression Statistical Model and Combination of Them, Bulletin de la Société Royale des Sciences de Liège , Vol. 85 , 2016 , p. 1148 - 1164 .