An Explainable Model For Diabetes Risk Prediction Alessandro Cabroni1 , Francesca Fallucchi1 1 Guglielmo Marconi University, Via Plinio 44, 00193 Rome, Italy Abstract In Artificial Intelligence, one of the most important issue concerns the necessity to understand why a particular prediction is chosen by a model from the considered input data. In this work, we propose a model, named Global Prediction Architecture, based on three layers (MultiLayerPerceptron, Closest Classes and Elements, and a third layer to combine them), where first layer produces both a partial prediction and features extraction useful for the second layer. We are interested in analyzing the behavior of the model both for accuracy and for explainability in terms of input data. We apply our study in the healthcare context of diabetes. Diabetes (diabetes mellitus) is a disease present when a person has a high blood sugar level for a long period. One import issue is related to the possibility to do prevention of the disease. We analyze the possibility to determine the diabetes risk in respect to daily lifestyle and health parameters, such as Body Mass Index, age, waist circumference, use of blood pressure medication, history of high blood glucose, physical activity, consumption of vegetables/fruits/berries, and family history of diabetes. We produce datasets randomly generated according to the rule named Finnish Diabetes Risk Score. This work aim to produce random and anonymized diabetes risk datasets, to test a model in terms of improving accuracy for the prediction of diabetes risk, and, most of all, to propose and test a method for explainability in the context of diabetes prediction, using an approach initially derived from Layer-Wise Relevance Propagation and Deep Taylor Decomposition. Keywords Diabetes risk prediction, FINnish Diabetes RIsk SCore, Multilayer Perceptron, Explainability, Layer-Wise Relevance Propa- gation, Deep Taylor Decomposition 1. Introduction curacy better than using some algorithms of Waikato Environment for Knowledge Analysis (WEKA) tool [2], In healthcare, one of the topics of interest is the diseases and an accuracy slightly better than using only MLP prevention. In this work, we consider the problem of component. We establish the diabetes risk according to identifying risks for type 2 diabetes for a person. We are daily lifestyle and health parameters, such as Body Mass interested in three principal issues: production of testing Index (BMI), age, waist circumference, use of blood pres- datasets, definition of a model to improve prediction ac- sure medication, history of high blood glucose, physical curacy, definition of an explainability method adequate activity, consumption of vegetables/fruits/berries, and to the prediction model. About first issue, we use dataset family history of diabetes. There are other works about randomly generated according to the rule FINnish Dia- the issue for diabetes (e.g. [3]). About third issue, we betes RIsk Score (FINDRISC) [1]. Using random datasets, propose an explainability solution based on reasoning we have the possibility to establish controlled data useful about relevance of input data in respect to the prediction. to compare different models, and without any privacy In particular, we combine a new solution conceptually problem. About second issue, we consider a new model derived from Layer-Wise Relevance Propagation (LRP) based on three layers, first of all Multilayer Perceptron and Deep Taylor Decomposition (DTD) (e.g., [5], [7]), (MLP). This layer produces both prediction and features with the distribution of extracted features testing data, extraction. Extracted features are used by a second layer to capture the relevance of the features in the second based on comparing one unlabeled node (testing node) layer. Hence, from the explainability point of view, we with all labelled nodes (training nodes) in terms of simi- have a theoretical model considering first, and implicitly larities, considering class (diabetes risk level) similarity third, layers, and a model based on data distribution (we too. Third layer put together the predictions of first two consider the standard deviation of the single feature in layers in a weighted manner. We name the overall model respect to training data) for second, and implicitly third, Global Prediction Architecture (GPA). We obtain an ac- layers. We could add our solution to other studies in a SYSTEM 2021 @ Scholar’s Yearly Symposium of Technology, similar context (e.g., [8]). We could also explore the use of Engineering and Mathematics. July 27–29, 2021, Catania, IT the solution in an Internet of Things (IoT) context, consid- " a.cabroni@studenti.unimarconi.it (A. Cabroni); ering the possibilities of 5G network too (about this last f.fallucchi@unimarconi.it (F. Fallucchi) subject, e.g. see [9], [11]), and for other domain different ~ http://www.gei.de/en/mitarbeiter/francesca-fallucchi-phd.html architectures (e.g. see [12], [13, 14]). The following sec- (F. Fallucchi)  0000-0002-0428-3234 (A. Cabroni); 0000-0002-3288-044X tions organize as follows. Related work section reports (F. Fallucchi) some works about prediction on diabetes and a summary © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). of the major concepts behind LRP and DTD. Methodol- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 17 Alessandro Cabroni et al. CEUR Workshop Proceedings 17–25 ogy section reports the step followed in the research for level of the NN by a conservation rule: this work. Tools and environments section reports the ∑︁ 𝑧𝑖𝑘 𝑅𝑗 = 𝑅𝑘 (1) principal tools and environments used to implement and ∑︀ 𝑗 𝑧𝑗𝑘 𝑘 test our solution. Dataset definition section outlines the rule used to implement random generated datasets, with 𝑧𝑖𝑘 corresponds to how much the neuron j contributes a visual distribution in terms of mean and standard devi- to be relevant for neuron k. The recursive propagation ation over all input attributes. Dataset analysis section finishes at the input data. One single step can be defined reports the result of an analysis conducted on the great- as a Taylor decomposition. In our context, we consider a est training dataset. It presents the results of a prediction MLP as an acyclic graph based on Rectified Linear Unit test using some algorithms available on WEKA tool. Pre- (ReLU) activation function at each layer with input data diction model section describes the model defined and not less than zero. Supposing to have a neuron N receiv- tested in this work in the context of diabetes risk predic- ing the scalar 𝑥𝑖𝑛𝑝𝑢𝑡 = (𝑥1 , . . . , 𝑥𝑛 ) and producing the tion. In this section, we present also the used accuracies scalar 𝑦𝑜𝑢𝑡𝑝𝑢𝑡 , we have: definition, the values of hyper-parameters which instan- ∑︁𝑛 tiate the model, and the prediction results. Explainability 𝑦𝑜𝑢𝑡𝑝𝑢𝑡 = max(0, 𝑥𝑖 𝑤𝑖𝑗 + 𝑏𝑗 ) (2) model section presents our solution to explain the behav- 𝑘=1 ior of the prediction model in terms of input data. This with 𝑏𝑗 <= 0. Considering DTD, we have that LRP section reports also the hyper-parameters used in the corresponds to a succession of Taylor expansions local test finalized to explainability and the results in terms of for each neuron. We now consider that the output can input data relevance for the prediction. In the conclusion be described as a first-order Taylor expansion. Defining section, we briefly summarize the obtained results about [𝑦𝑜𝑢𝑡𝑝𝑢𝑡 ]𝑖 as the redistribution of 𝑦𝑜𝑢𝑡𝑝𝑢𝑡 on the neuron dataset creation, accuracy prediction, and explainability. i of the lower layer, we have the rule of redistribution (𝑧 + -rule) when the lower level of N is a ReLU layer: + 𝑥𝑖 𝑤𝑖,𝑗 2. Related work [𝑦𝑜𝑢𝑡𝑝𝑢𝑡 ]𝑖 = ∑︀𝑛 + 𝑦𝑜𝑢𝑡𝑝𝑢𝑡 (3) 𝑘=1 𝑥𝑘 𝑤𝑘𝑗 In this section, we first briefly cite three chosen works n equals to the number of neurons in the lower level of about prediction in the context of diabetes. Then, we N; 𝑣 + = |𝑣|. Defining 𝑥𝑓 as the final output of the NN present some concepts about explainability, in particular for a particular input data, we have that [[𝑥𝑓 ]𝑗 ]𝑖 corre- for LRP and DTD. sponds to the quantity of 𝑥𝑓 distributed from one node j In [15], they use Pima Indian Diabetes (PID) dataset to one node i, where i is an input node for node j. [𝑥𝑓 ]𝑖 and they test seven Machine Learning (ML) algorithms corresponds to the quantity of 𝑥𝑓 distributed on node i: for predictions related to diabetes, using WEKA tool too. 1 𝑛 They obtain the best results by using Logistic Regression ∑︁ (LR) and Support Vector Machine (SVM) for diabetes pre- [𝑥𝑓 ]𝑖 = [[𝑥𝑓 ]𝑗 ]𝑖 (4) 𝑗=1 diction. They also implemented a Neural Network (NN) with two hidden layers for the accuracy. In [16], they 𝑛1 equals to the number of nodes in the higher level for evaluate the risk of diabetes based on lifestyles and fam- node i. Considering [𝑥𝑓 ]𝑗 = 𝑥𝑗 𝑐𝑗 (neuron activation ily background. They consider 952 instances produced and constant value), we have: by questionnaire related to health, lifestyle and family 𝑛1 𝑛1 𝑛1 background. They applied different ML algorithms both ∑︁ ∑︁ ∑︁ [𝑥𝑓 ]𝑖 = [[𝑥𝑓 ]𝑗 ]𝑖 = [𝑥𝑗 𝑐𝑗 ]𝑖 = [𝑥𝑗 ]𝑖 𝑐𝑗 = 𝑥𝑖 𝑐𝑖 to this dataset and to PID dataset. Most accurate perfor- 𝑗=1 𝑗=1 𝑗=1 mance is for Random Forest (RF) Classifier. Also in [17], (5) they trained the ML models using PID dataset. They pro- 𝑛1 equals to the number of nodes in the higher level of pose a framework based on pre-processing, K-fold Cross- node i. Moreover, validation (KCV), Grid search for hyper-parameters, to 𝑛 1 + select the best model among different algorithms. In fu- ∑︁ 𝑤𝑖𝑗 [𝑥𝑓 ]𝑗 𝑐𝑖 = ∑︀𝑛 + (6) ture work they are interested in applying their results in 𝑗=1 𝑖1 =1 𝑖1 𝑤𝑖1 𝑗 𝑥 other medical context to verify the general usefulness. In the general context of explainability, as basis for our 𝑛1 equals to the number of nodes in the higher level studying we are interested on LRP and DTD. In this sec- of node i, and n is the number of neurons in the lower tion, we review some of the concepts described in [5], [7], level of j (the same level of i). At the beginning, we have [18], [19], and [20]. In LRP, prediction back propagate [𝑥𝑓 ]𝑓 = 𝑥𝑓 𝑐𝑓 and 𝑐𝑓 = 1. By induction, there is a in the NN. Each propagation redistributes in the lower product structure with a backward propagation rule and the conservation of the output (redistribution on input nodes). 18 Alessandro Cabroni et al. CEUR Workshop Proceedings 17–25 Table 1 Used environments and tools. Set Tool and environment Datasets definition Colaboratory (Colab) - Backend Google Compute Engine Python™ 3 Prediction RAM: 0.75GB out of 12.69GB; available disk space: 38.47GB Explainability out of 107.72GB Tensor Processing Unit (TPU) Weka 3.8.5 - Windows 8.1 (64 bit) – Intel® Celeron® CPU 1007U Dataset analysis 1.50Ghz RAM 4G (3.88G usable) 3. Methodology age (years), waist circumference (differentiating for gen- der), use of blood pressure medication, history of high In this research, we firstly analyzed some papers about blood glucose, physical activity expressed in hours/week, diabetes risk predictions and explainability for LRP and daily consumption of vegetables, fruits or berries, family DTD. We selected a rule (FINDRISC) to produce random history of diabetes. The score is calculated according to datasets. We defined our prediction model, named GPA, the rule. The random input data are normalized to [0, 1]. with three layers: MLP for partial prediction and features These are the input data of our prediction model, while extraction, Closest Classes and Elements (CCE) for partial the risk score is the right prediction. In Figure 1, we can prediction, Weighted Sum (WS) for final prediction. CCE see the distribution (mean and standard deviation) of the evaluates similarity between one unlabeled node and all generated datasets. labelled nodes, using extracted features. WS sums the two partial predictions, adequately weighted, to define the final prediction by argmax. We analyzed the best 6. Dataset analysis hyper-parameters, using GridSearchCV of scikit-learn tool too. We analyzed the test predictions comparing For a preliminary analysis of the produced datasets, we GPA accuracies against MLP accuracies and some WEKA considered the dataset with 2000 elements and we ana- algorithms accuracies. We defined explainability solu- lyzed in details both their data distribution and accuracy tion based on a forward DTD derived component for results, by using WEKA tool [2]. In Table 2 we can see first and implicitly third layer, and on weight (standard accuracy results for the considered algorithms: J48, KStar, deviation) of extracted features (related to the training MLP, Naïve Bayes (NB), RandomTree (RT). We used 10- data) for second and implicitly third layer. We tested the fold cross-validation for the analysis. explainability using a simplified MLP. 7. Prediction model 4. Tools and environments Our prediction model have three layers: first layer is Table 1 reports the used environments and tools distin- MLP, second layer is CCE (it uses features extracted from guishing between first set (datasets definition, prediction, MLP), and third layer combines predictions of both MLP explainability) and second set (only dataset analysis). and CCE. MLP have the following elements: dense, batch normalization, activation–ReLU, dropout, dense, batch normalization, activation–ReLU, dropout, dense, batch 5. Dataset definition normalization, activation–ReLU, dropout, dense, activa- tion–Softmax. CCE uses the features extracted from third We generate eight datasets according to FINDRISC [1]. dense layer; for each testing node we implement this al- Four datasets are for prediction experiments (2500 el- gorithm: ements for testing; 1000, 1500, and 2000 for training), and other four are for explainability experiments (1750 • Calculate Euclidean distance between the consid- elements for testing; 1000, 1250, and 1500 for training). ered testing node and all training nodes The rule identifies risk individuals, without laboratory • Normalize these distances to [0,1] tests. It considers five risk levels in respect to score: very • Using normalized distances, calculate Gaussian low (0-3), low (4-8), moderate (9-12), high (13-20) and kernel similarity between the considered testing very high (21-26). All datasets are equally balanced in node and all training nodes: respect to the possible scores. These are the attributes to be considered: BMI (weight (kg) / height squared (m2)), − 𝑑(𝑖,𝑗)2 𝑠𝑖𝑚𝑖,𝑗 = 𝑒 2 𝜎2 (9) 19 Alessandro Cabroni et al. CEUR Workshop Proceedings 17–25 Figure 1: Dataset distribution for input data (from left to right: prediction and explainability experiments). Table 2 WEKA prediction tests using dataset with 2000 nodes (10-fold cross-validation). Model Accuracy J48 0.826 KStar 0.71 MLP 0.7965 NB 0.713 RT 0.7595 • Normalize these similarities to [0, 1] 7.1. Accuracy definition • For each possible label (risk class), calculate the We consider two accuracy definitions (eq. 7 and eq. 8). sum of similarities The second definition uses the similarities between labels, • Normalize all sums to the overall sum because risk levels are orderable. In particular: np is the • Recalculate sum distribution, considering the sim- number of unlabeled nodes, 𝑅𝐿𝑖 is the right label for i ilarity between labels too, according to the fol- node, 𝑃 𝐿𝑖 (𝑗) is the value of probability distribution for lowing algorithm (m is the number of labels/risk unlabeled node i considering label j, and m is the number classes): of possible labels (classes, risk levels). STemp=NP.copy(S) S[0]=STemp[0]+STemp[1]*(m-1)/m 7.2. Hyper-parameters for h in range(1,m-1): S[h]=STemp[h]+(STemp[h- Hyper-parameters have been chosen with some prelimi- 1]+STemp[h+1])/2*(m-1)/m nary tests. Initially, for MLP component we considered S[m-1]=STemp[m-1]+STemp[m-2]*(m-1)/m GridSearchCV for a first analysis. We produced a random S=S/NP.sum(S) dataset of 1000 nodes using mlpClassifier with 200 max iterations and the following parameter space: hidden- LayerSizes with (128,256,32), (256,512,32), (512,1024,32); The third layer of the prediction model, considering the learningRateInit with 0.01, 0.1; validationFraction with single testing node, for each class produces the weighted 0.1, 0.2; batchSize with 50, 100. For this analysis, we sum of probabilities obtained by both MLP and CCE for obtained this results as best: batchSize=100, hiddenLay- that class. Prediction is obtained by argmax function. ∑︀𝑛𝑝 𝑖=1 𝑖𝑓 (𝑎𝑟𝑔𝑚𝑎𝑥𝑗∈{1,...,𝑚} 𝑃 𝐿𝑖 (𝑗) = 𝑅𝐿𝑖 , 1, 0) 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (7) 𝑛𝑝 ∑︀𝑛𝑝 |𝑎𝑟𝑔𝑚𝑎𝑥𝑗∈{1,...,𝑚} 𝑃 𝐿𝑖 (𝑗)−𝑅𝐿𝑖 | 𝑖=1 (1 − 𝑚−1 ) 𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦1 = (8) 𝑛𝑝 20 Alessandro Cabroni et al. CEUR Workshop Proceedings 17–25 Table 3 Hyper-parameters for prediction tests. Component Parameter Value input numberOfAttributes 9 MLP batchSizeMLP 100 MLP decayMLP 1e-6 MLP dropoutParameterMLP 0.25 MLP epochsMLP 1000 MLP learningRateMLP 0.01 MLP unitsFirstDenseMLP 512 MLP unitsSecondDenseMLP 1024 MLP unitsThirdDenseMLP 16, 32 (extracted features) MLP unitsFourthDenseMLP 5 (prediction classes) MLP validationSplitMLP 0.2 CCE gaussianKernelWidth 0.5 WS modelWeightMLP 0.05 WS modelWeightCCE 0.95 Table 4 Accuracies results for MLP and GPA Model Number of labelled nodes Number of extracted features Accuracy Accuracy1 MLP 1000 16 0.8252 0.9559 MLP 1000 32 0.8188 0.9543 MLP 1500 16 0.8356 0.9587 MLP 1500 32 0.8392 0.9597 MLP 2000 16 0.8552 0.9638 MLP 2000 32 0.8564 0.9641 GPA 1000 16 0.82673008 0.9562 GPA 1000 32 0.82512752 0.9558 GPA 1500 16 0.8399342399999999 0.9597 GPA 1500 32 0.8463356799999999 0.9614 GPA 2000 16 0.8599420799999999 0.9649 GPA 2000 32 0.8583425600000001 0.9645 erSizes=(128, 256, 32), learningRateInit=0.01, validation- 8. Explainability model Fraction=0.1. After other empirical tests, we chose the hyper-parameters described in Table 3. Considering DTD theory, we define a simplified rule to calculate the relevance of the single input parameter against the single feature extracted from MLP first layer. 7.3. Prediction results We consider the weights of the edges for the trained In Table 4, we present accuracy results for both only MLP MLP. We do not consider biases. Moreover, we manage component and all model GPA. In Figure 2, we outline the possible weight of the features for the Gaussian ker- the differences between accuracy of GPA and MLP. As we nel distances in the CCE layer. The potential weight is can see, we have a slightly better performance with GPA. calculated according to labelled nodes, corresponding Moreover, the results are better than the results obtained to training data. We weight the standard deviation of with the algorithms tested with WEKA tool. Of course, a feature. In fact, features with a high variation give a we must remember that we are reasoning with restricted high contribute to the substantial distances and so they random datasets and so our conclusions are useful only have a significant contribute for prediction classification from a testing point of view and not for healthcare formal (we could also analyze better the possibility to normalize deductions. In Table 5, we present execution times for training features values). We first obtain a formula for prediction tests. explainability which does not depend from the particular input data and prediction. Then, we apply the formula to a single input data by multiplication, so to calculate the percentage of relevance for that parameter in the 21 Alessandro Cabroni et al. CEUR Workshop Proceedings 17–25 Figure 2: GPA-MLP accuracy/accuracy1 vs number of extracted feature (left: 16; right: 32). Table 5 Execution times of prediction model. Number of labelled nodes Number of extracted features Execution time 1000 16 00:03:28 1000 32 00:03:31 1500 16 00:04:50 1500 32 00:04:53 2000 16 00:06:14 2000 32 00:06:21 considered prediction. Here, we use a forward definition • F is the number of extracted features for explainability, maintaining the conservation of total • 𝑥𝑖𝑓 is the value of feature f for labelled node i relevance. Formally, if we have a normalized unlabeled • N is the number of labelled nodes (training dataset) node represented by the input data (𝑣1 , . . . , 𝑣𝑛 ), with • 𝐶𝑙+1 is the number of neurons for layer l+1 of ∀𝑖 ∈ {1, . . . , 𝑛} 𝑣𝑖 ∈ [0, 1], we can establish the rele- MLP (layer 0 is the input data) vance of input i in the prediction as 𝑅𝑖,𝑣 ddefined in the • LF is the layer of MLP for features extraction; equations (10), (11), (12), (13). Where: for a particular f, this layer has only one neuron ∑︀𝐹 𝑓 𝑣𝑖 𝑓 =1 [𝑅𝑛𝑜𝑟𝑚𝑖,0 𝜎𝑓 ] 𝑅𝑖,𝑣 = ∑︀𝑛 ∑︀𝐹 𝑓 (10) 𝑗=1 𝑣𝑗 𝑓 =1 [𝑅𝑛𝑜𝑟𝑚𝑗,0 𝜎𝑓 ] √︂ ∑︀𝑁 √︂ 𝑖 ∑︀𝑁 𝑖 ∑︀𝑁 𝑖 𝑖=1 𝑥𝑓 ∑︀𝑁 𝑖 𝑖=1 𝑥𝑓 𝑖=1 (𝑥𝑓 − 𝑁 )2 − 𝑚𝑖𝑛𝑓 ∈{1,...𝑓 } 𝑖=1 (𝑥𝑓 − 𝑁 )2 𝜎𝑓 = √︂ ∑︀𝑁 √︂ (11) 𝑖 ∑︀𝑁 𝑖 ∑︀𝑁 𝑖 𝑖=1 𝑥𝑓 ∑︀𝑁 𝑖 𝑖=1 𝑥𝑓 𝑚𝑎𝑥𝑓 ∈{1,...𝑓 } 𝑖=1 (𝑥𝑓 − 𝑁 )2 − 𝑚𝑖𝑛 𝑓 ∈{1,...𝑓 } 𝑖=1 (𝑥𝑓 − 𝑁 )2 𝐶𝑙+1 𝑓 + ∑︁ 𝑅𝑛𝑜𝑟𝑚𝑗,𝑙+1 𝑤𝑖𝑙 𝑗 𝑙+1 𝑅𝑛𝑜𝑟𝑚𝑓𝑖,𝑙 = ∑︀𝐶𝑙 + (12) 𝑗=1 𝑘=1 𝑤𝑘𝑙 𝑗 𝑙+1 𝑅𝑛𝑜𝑟𝑚𝑓1,𝐿𝐹 = 1 (13) 22 Alessandro Cabroni et al. CEUR Workshop Proceedings 17–25 Table 6 Hyper-parameters for explainability tests. Component Parameter Value input numberOfAttributes 9 MLP batchSizeMLP 100 MLP decayMLP 1e-6 MLP dropoutParameterMLP 0.25 MLP epochsMLP 1000 MLP learningRateMLP 0.01 MLP unitsFirstDenseMLP 64 MLP unitsSecondDenseMLP 128 MLP unitsThirdDenseMLP 8 (extracted features) MLP unitsFourthDenseMLP 5 (prediction classes) MLP validationSplitMLP 0.2 CCE gaussianKernelWidth 0.5 WS modelWeightMLP 0.05 WS modelWeightCCE 0.95 explainability C [9,64,128,1] explainability LF 3 explainability MLPLevelsForExplainability [0,4,8] corresponding to the particular feature extracted f • 𝑤𝑖+𝑙 𝑗 𝑙+1 is the absolute value of the weight of MLP for the edge which connect neuron i of layer l with neuron j of layer l+1 As we can see, 𝑣𝑖 is the only element related to the par- ticular input data. All the other elements expressed in the formulas depends only on fixed hyper-parameters and on original training dataset. Moreover, in our tests, we calculate the constant components of 𝑅𝑖,𝑣 only once, so to optimize the tests computation. 8.1. Hyper-parameters Figure 3: Average relevancies vs number of training nodes Starting from hyper-parameters used for prediction tests, (left/top: 1000; right/top: 1250; left/bottom: 1500). we simplified MLP component establishing new hyper- parameters for explainability tests, where we repeated training and prediction process too. We report the chosen 9. Conclusion configuration values in Table 6. In this work, we have proposed an explainable model to 8.2. Explainability results predict diabetes risk. We have tested our model using ran- dom defined datasets produced according to a healthcare In Figure 3, we present the results of explainability tests. rule named FINDRISC. We chose to define random data to In particular, we can see the average relevancies for all have the possibility to evaluate our model in a controlled input data in respect to all testing predictions. E.g., we manner (input data are sufficiently distributed and risk can see particular relevancies for parameter 1 (age), 2 predictions are equally distributed) and to overcome any (BMI), 3 (waist circumference), 8 (family history) and privacy problem. We defined our model, named GPA, us- less relevancies for parameter 2 (gender). Of course, we ing three layers. First layer considers a MLP module and must remember that we are reasoning with restricted it is used both to produce a first partial mixed prediction random datasets and so our conclusions are useful only and to extract features for the second layer. This layer, from a testing point of view and not for healthcare formal named CCE, produces a partial mixed prediction consid- deductions. In Table 7, we present execution times for ering the similarity between a single unlabeled node in explainability tests. 23 Alessandro Cabroni et al. CEUR Workshop Proceedings 17–25 Table 7 Execution times of explainability model. Number of labelled nodes Number of extracted features Execution time 1000 8 00:02:44 1250 8 00:02:32 1500 8 00:02:39 respect to all labelled nodes, managing class distances cisions with deep Taylor decomposition, Pattern too. In this layer, a node is represented by the features ex- Recognition, 2017 tracted from first layer. Third layer, named WS, considers [8] L. Kopitar, et al. (2019) Local vs. Global Interpretabil- the sum of the partial mixed prediction of the first and ity of Machine Learning Models in Type 2 Diabetes second layer (in a weighted manner) to obtain the final Mellitus Screening. https://doi.org/10.1007/978-3- prediction by argmax. Experimentally, we noticed that 030-37446-4_9 accuracy improves using the all GPA model in respect to [9] F Mazzenga, R Giuliano, F Vatalaro, "FttC- using only MLP layer. Moreover, we noticed that accu- based fronthaul for 5G dense/ultra-dense ac- racy results are better than considering accuracy results cess network: Performance and costs in real- produced using some algorithms of WEKA tool. The istic scenarios", Future Internet 9 (4), 71, 2017, most contribute of our research is the explainability of https://doi.org/10.3390/fi9040071 our model in terms of input parameters, useful for a MD [10] Russo, S., Illari, S.I., Avanzato, R., Napoli, C. Reduc- (medical doctor) understanding, also considering more ing the psychological burden of isolated oncological predictions together. Generally, we must remember that patients by means of decision trees (2020) CEUR now our conclusions are useful only from a testing point Workshop Proceedings, 2768, pp. 46-53. of view and not for real deductions. [11] Giuliano, R., Mazzenga, F., Vizzarri, A., "Satellite- Based Capillary 5G-mMTC Networks for Environ- mental Applications", IEEE Aerospace and Elec- References tronic Systems Magazine, 2019, 34(10), pp. 40–48, https://doi.org/10.1109/MAES.2019.2923295 [1] https://www.mdcalc.com/findrisc-finnish- [12] Capizzi, G., Lo Sciuto, G., Napoli, C., Tramontana, E. diabetes-risk-score, last accessed 2021/16/07 A multithread nested neural network architecture [2] Weka 3: Machine Learning Software in Java, to model surface plasmon polaritons propagation https://www.cs.waikato.ac.nz/ml/weka/index.html, (2016) Micromachines, 7 (7), art. no. 110 (7) last accessed 2021/16/07 [13] Capizzi, G., Lo Sciuto, G., Napoli, C., Tramontana, [3] Xiong, Xl., Zhang, Rx., Bi, Y. et al. Machine Learning E., Woźniak, M. A novel neural networks-based tex- Models in Type 2 Diabetes Risk Prediction: Results ture image processing algorithm for orange defects from a Cross-sectional Retrospective Study in Chi- classification (2016) International Journal of Com- nese Adults. CURR MED SCI 39, 582–588 (2019). puter Science and Applications, 13 (2), pp. 45-60. https://doi.org/10.1007/s11596-019-2077-4 (7) [4] Illari, S.I., Russo, S., Avanzato, R., Napoli, C. A cloud- [14] Napoli, C., Bonanno, F., Capizzi, G. Exploiting oriented architecture for the remote assessment solar wind time series correlation with magneto- and follow-up of hospitalized patients. (2020) CEUR spheric response by using an hybrid neuro-wavelet Workshop Proceedings, Vol. 2694, pp. 29-35. approach. (2010) Proceedings of the International [5] Bach S, Binder A, Montavon G, Klauschen F, Astronomical Union, 6 (S274), pp. 156-158. DOI: Müller K-R, Samek W, On Pixel-Wise Explana- 10.1017/S1743921311006806 tions for Non-Linear Classifier Decisions by Layer- [15] Jobeda Jamal Khanam, Simon Y. Foo, A compar- Wise Relevance Propagation, PLOS ONE, 2015, ison of machine learning algorithms for diabetes https://doi.org/10.1371/journal.pone.0130140 prediction, ICT Express, 2021, ISSN 2405-9595 [6] Napoli C, Pappalardo G, Tramontana E, A hybrid [16] Neha Prerna Tigga, Shruti Garg, Prediction of neuro–wavelet predictor for qos control and stabil- Type 2 Diabetes using Machine Learning Classi- ity. (2013) Lecture Notes in Computer Science, Vol. fication Methods, Procedia Computer Science, Vol- 8249 LNAI, pp. 527–538. ume 167, 2020, Pages 706-716, ISSN 1877-0509, [7] Montavon G, Lapuschkin S, Binder A, Samek W, https://doi.org/10.1016/j.procs.2020.03.336 Müller K-R, Explaining nonlinear classification de- [17] M. K. Hasan, et al., "Diabetes Prediction Using En- sembling of Different Machine Learning Classifiers", 24 Alessandro Cabroni et al. CEUR Workshop Proceedings 17–25 2020, doi: 10.1109/ACCESS.2020.2989857 [18] Montavon G, et al., Layer-Wise Relevance Prop- agation: An Overview, Explainable AI: Interpret- ing, Explaining and Visualizing Deep Learning, https://doi.org/10.1007/978-3-030-28954-6_10 [19] Montavon G, Bach S, Binder A, Samek W, Müller K-R, Deep Taylor Decomposition of Neural Net- works, Proceedings of the ICML’16 Workshop on Visualization for Deep Learning, 2016 [20] Kauffmann J, et al.,Towards explaining anomalies: A deep Taylor decomposi- tion of one-class models,Pattern Recogni- tion,https://doi.org/10.1016/j.patcog.2020.107198 25