An Explainable Model For Diabetes Risk Prediction
Alessandro Cabroni1 , Francesca Fallucchi1
1
    Guglielmo Marconi University, Via Plinio 44, 00193 Rome, Italy


                                             Abstract
                                             In Artificial Intelligence, one of the most important issue concerns the necessity to understand why a particular prediction is
                                             chosen by a model from the considered input data. In this work, we propose a model, named Global Prediction Architecture,
                                             based on three layers (MultiLayerPerceptron, Closest Classes and Elements, and a third layer to combine them), where first
                                             layer produces both a partial prediction and features extraction useful for the second layer. We are interested in analyzing the
                                             behavior of the model both for accuracy and for explainability in terms of input data. We apply our study in the healthcare
                                             context of diabetes. Diabetes (diabetes mellitus) is a disease present when a person has a high blood sugar level for a long
                                             period. One import issue is related to the possibility to do prevention of the disease. We analyze the possibility to determine
                                             the diabetes risk in respect to daily lifestyle and health parameters, such as Body Mass Index, age, waist circumference, use
                                             of blood pressure medication, history of high blood glucose, physical activity, consumption of vegetables/fruits/berries, and
                                             family history of diabetes. We produce datasets randomly generated according to the rule named Finnish Diabetes Risk Score.
                                             This work aim to produce random and anonymized diabetes risk datasets, to test a model in terms of improving accuracy
                                             for the prediction of diabetes risk, and, most of all, to propose and test a method for explainability in the context of diabetes
                                             prediction, using an approach initially derived from Layer-Wise Relevance Propagation and Deep Taylor Decomposition.

                                             Keywords
                                             Diabetes risk prediction, FINnish Diabetes RIsk SCore, Multilayer Perceptron, Explainability, Layer-Wise Relevance Propa-
                                             gation, Deep Taylor Decomposition


1. Introduction                                                                                                            curacy better than using some algorithms of Waikato
                                                                                                                           Environment for Knowledge Analysis (WEKA) tool [2],
In healthcare, one of the topics of interest is the diseases                                                               and an accuracy slightly better than using only MLP
prevention. In this work, we consider the problem of                                                                       component. We establish the diabetes risk according to
identifying risks for type 2 diabetes for a person. We are                                                                 daily lifestyle and health parameters, such as Body Mass
interested in three principal issues: production of testing                                                                Index (BMI), age, waist circumference, use of blood pres-
datasets, definition of a model to improve prediction ac-                                                                  sure medication, history of high blood glucose, physical
curacy, definition of an explainability method adequate                                                                    activity, consumption of vegetables/fruits/berries, and
to the prediction model. About first issue, we use dataset                                                                 family history of diabetes. There are other works about
randomly generated according to the rule FINnish Dia-                                                                      the issue for diabetes (e.g. [3]). About third issue, we
betes RIsk Score (FINDRISC) [1]. Using random datasets,                                                                    propose an explainability solution based on reasoning
we have the possibility to establish controlled data useful                                                                about relevance of input data in respect to the prediction.
to compare different models, and without any privacy                                                                       In particular, we combine a new solution conceptually
problem. About second issue, we consider a new model                                                                       derived from Layer-Wise Relevance Propagation (LRP)
based on three layers, first of all Multilayer Perceptron                                                                  and Deep Taylor Decomposition (DTD) (e.g., [5], [7]),
(MLP). This layer produces both prediction and features                                                                    with the distribution of extracted features testing data,
extraction. Extracted features are used by a second layer                                                                  to capture the relevance of the features in the second
based on comparing one unlabeled node (testing node)                                                                       layer. Hence, from the explainability point of view, we
with all labelled nodes (training nodes) in terms of simi-                                                                 have a theoretical model considering first, and implicitly
larities, considering class (diabetes risk level) similarity                                                               third, layers, and a model based on data distribution (we
too. Third layer put together the predictions of first two                                                                 consider the standard deviation of the single feature in
layers in a weighted manner. We name the overall model                                                                     respect to training data) for second, and implicitly third,
Global Prediction Architecture (GPA). We obtain an ac-                                                                     layers. We could add our solution to other studies in a
SYSTEM 2021 @ Scholar’s Yearly Symposium of Technology,                                                                    similar context (e.g., [8]). We could also explore the use of
Engineering and Mathematics. July 27–29, 2021, Catania, IT                                                                 the solution in an Internet of Things (IoT) context, consid-
" a.cabroni@studenti.unimarconi.it (A. Cabroni);                                                                           ering the possibilities of 5G network too (about this last
f.fallucchi@unimarconi.it (F. Fallucchi)                                                                                   subject, e.g. see [9], [11]), and for other domain different
~ http://www.gei.de/en/mitarbeiter/francesca-fallucchi-phd.html                                                            architectures (e.g. see [12], [13, 14]). The following sec-
(F. Fallucchi)
 0000-0002-0428-3234 (A. Cabroni); 0000-0002-3288-044X
                                                                                                                           tions organize as follows. Related work section reports
(F. Fallucchi)                                                                                                             some works about prediction on diabetes and a summary
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                           of the major concepts behind LRP and DTD. Methodol-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                                      17
Alessandro Cabroni et al. CEUR Workshop Proceedings                                                                                    17–25


ogy section reports the step followed in the research for        level of the NN by a conservation rule:
this work. Tools and environments section reports the                                   ∑︁ 𝑧𝑖𝑘
                                                                                  𝑅𝑗 =               𝑅𝑘                                   (1)
principal tools and environments used to implement and
                                                                                            ∑︀
                                                                                               𝑗 𝑧𝑗𝑘
                                                                                                 𝑘
test our solution. Dataset definition section outlines the
rule used to implement random generated datasets, with           𝑧𝑖𝑘 corresponds to how much the neuron j contributes
a visual distribution in terms of mean and standard devi-        to be relevant for neuron k. The recursive propagation
ation over all input attributes. Dataset analysis section        finishes at the input data. One single step can be defined
reports the result of an analysis conducted on the great-        as a Taylor decomposition. In our context, we consider a
est training dataset. It presents the results of a prediction    MLP as an acyclic graph based on Rectified Linear Unit
test using some algorithms available on WEKA tool. Pre-          (ReLU) activation function at each layer with input data
diction model section describes the model defined and            not less than zero. Supposing to have a neuron N receiv-
tested in this work in the context of diabetes risk predic-      ing the scalar 𝑥𝑖𝑛𝑝𝑢𝑡 = (𝑥1 , . . . , 𝑥𝑛 ) and producing the
tion. In this section, we present also the used accuracies       scalar 𝑦𝑜𝑢𝑡𝑝𝑢𝑡 , we have:
definition, the values of hyper-parameters which instan-                                       ∑︁𝑛

tiate the model, and the prediction results. Explainability                 𝑦𝑜𝑢𝑡𝑝𝑢𝑡 = max(0,           𝑥𝑖 𝑤𝑖𝑗 + 𝑏𝑗 )       (2)
model section presents our solution to explain the behav-                                               𝑘=1

ior of the prediction model in terms of input data. This         with 𝑏𝑗 <= 0. Considering DTD, we have that LRP
section reports also the hyper-parameters used in the            corresponds to a succession of Taylor expansions local
test finalized to explainability and the results in terms of     for each neuron. We now consider that the output can
input data relevance for the prediction. In the conclusion       be described as a first-order Taylor expansion. Defining
section, we briefly summarize the obtained results about         [𝑦𝑜𝑢𝑡𝑝𝑢𝑡 ]𝑖 as the redistribution of 𝑦𝑜𝑢𝑡𝑝𝑢𝑡 on the neuron
dataset creation, accuracy prediction, and explainability.       i of the lower layer, we have the rule of redistribution
                                                                 (𝑧 + -rule) when the lower level of N is a ReLU layer:
                                                                                                  +
                                                                                              𝑥𝑖 𝑤𝑖,𝑗
2. Related work                                                              [𝑦𝑜𝑢𝑡𝑝𝑢𝑡 ]𝑖 = ∑︀𝑛        + 𝑦𝑜𝑢𝑡𝑝𝑢𝑡                           (3)
                                                                                             𝑘=1 𝑥𝑘 𝑤𝑘𝑗
In this section, we first briefly cite three chosen works        n equals to the number of neurons in the lower level of
about prediction in the context of diabetes. Then, we            N; 𝑣 + = |𝑣|. Defining 𝑥𝑓 as the final output of the NN
present some concepts about explainability, in particular        for a particular input data, we have that [[𝑥𝑓 ]𝑗 ]𝑖 corre-
for LRP and DTD.                                                 sponds to the quantity of 𝑥𝑓 distributed from one node j
   In [15], they use Pima Indian Diabetes (PID) dataset          to one node i, where i is an input node for node j. [𝑥𝑓 ]𝑖
and they test seven Machine Learning (ML) algorithms             corresponds to the quantity of 𝑥𝑓 distributed on node i:
for predictions related to diabetes, using WEKA tool too.                                              1
                                                                                                      𝑛
They obtain the best results by using Logistic Regression                                             ∑︁
(LR) and Support Vector Machine (SVM) for diabetes pre-                                 [𝑥𝑓 ]𝑖 =            [[𝑥𝑓 ]𝑗 ]𝑖                    (4)
                                                                                                      𝑗=1
diction. They also implemented a Neural Network (NN)
with two hidden layers for the accuracy. In [16], they           𝑛1 equals to the number of nodes in the higher level for
evaluate the risk of diabetes based on lifestyles and fam-       node i. Considering [𝑥𝑓 ]𝑗 = 𝑥𝑗 𝑐𝑗 (neuron activation
ily background. They consider 952 instances produced             and constant value), we have:
by questionnaire related to health, lifestyle and family                    𝑛1
                                                                                                 𝑛1
                                                                                                                         𝑛1

background. They applied different ML algorithms both
                                                                            ∑︁                   ∑︁                      ∑︁
                                                                 [𝑥𝑓 ]𝑖 =         [[𝑥𝑓 ]𝑗 ]𝑖 =         [𝑥𝑗 𝑐𝑗 ]𝑖 =             [𝑥𝑗 ]𝑖 𝑐𝑗 = 𝑥𝑖 𝑐𝑖
to this dataset and to PID dataset. Most accurate perfor-                   𝑗=1                  𝑗=1                     𝑗=1
mance is for Random Forest (RF) Classifier. Also in [17],                                                            (5)
they trained the ML models using PID dataset. They pro-          𝑛1 equals to the number of nodes in the higher level of
pose a framework based on pre-processing, K-fold Cross-          node i. Moreover,
validation (KCV), Grid search for hyper-parameters, to                                     𝑛 1
                                                                                                     +
select the best model among different algorithms. In fu-
                                                                                           ∑︁     𝑤𝑖𝑗  [𝑥𝑓 ]𝑗
                                                                                    𝑐𝑖 =       ∑︀𝑛            +                           (6)
ture work they are interested in applying their results in                                 𝑗=1   𝑖1 =1 𝑖1 𝑤𝑖1 𝑗
                                                                                                       𝑥
other medical context to verify the general usefulness.
In the general context of explainability, as basis for our       𝑛1 equals to the number of nodes in the higher level
studying we are interested on LRP and DTD. In this sec-          of node i, and n is the number of neurons in the lower
tion, we review some of the concepts described in [5], [7],      level of j (the same level of i). At the beginning, we have
[18], [19], and [20]. In LRP, prediction back propagate          [𝑥𝑓 ]𝑓 = 𝑥𝑓 𝑐𝑓 and 𝑐𝑓 = 1. By induction, there is a
in the NN. Each propagation redistributes in the lower           product structure with a backward propagation rule and
                                                                 the conservation of the output (redistribution on input
                                                                 nodes).


                                                            18
Alessandro Cabroni et al. CEUR Workshop Proceedings                                                              17–25


Table 1
Used environments and tools.
                     Set                                       Tool and environment
              Datasets definition      Colaboratory (Colab) - Backend Google Compute Engine Python™ 3
                  Prediction               RAM: 0.75GB out of 12.69GB; available disk space: 38.47GB
                Explainability                   out of 107.72GB Tensor Processing Unit (TPU)

                                         Weka 3.8.5 - Windows 8.1 (64 bit) – Intel® Celeron® CPU 1007U
               Dataset analysis
                                                        1.50Ghz RAM 4G (3.88G usable)


3. Methodology                                             age (years), waist circumference (differentiating for gen-
                                                           der), use of blood pressure medication, history of high
In this research, we firstly analyzed some papers about blood glucose, physical activity expressed in hours/week,
diabetes risk predictions and explainability for LRP and daily consumption of vegetables, fruits or berries, family
DTD. We selected a rule (FINDRISC) to produce random history of diabetes. The score is calculated according to
datasets. We defined our prediction model, named GPA, the rule. The random input data are normalized to [0, 1].
with three layers: MLP for partial prediction and features These are the input data of our prediction model, while
extraction, Closest Classes and Elements (CCE) for partial the risk score is the right prediction. In Figure 1, we can
prediction, Weighted Sum (WS) for final prediction. CCE see the distribution (mean and standard deviation) of the
evaluates similarity between one unlabeled node and all generated datasets.
labelled nodes, using extracted features. WS sums the
two partial predictions, adequately weighted, to define
the final prediction by argmax. We analyzed the best 6. Dataset analysis
hyper-parameters, using GridSearchCV of scikit-learn
tool too. We analyzed the test predictions comparing For a preliminary analysis of the produced datasets, we
GPA accuracies against MLP accuracies and some WEKA considered the dataset with 2000 elements and we ana-
algorithms accuracies. We defined explainability solu- lyzed in details both their data distribution and accuracy
tion based on a forward DTD derived component for results, by using WEKA tool [2]. In Table 2 we can see
first and implicitly third layer, and on weight (standard accuracy results for the considered algorithms: J48, KStar,
deviation) of extracted features (related to the training MLP, Naïve Bayes (NB), RandomTree (RT). We used 10-
data) for second and implicitly third layer. We tested the fold cross-validation for the analysis.
explainability using a simplified MLP.
                                                               7. Prediction model
4. Tools and environments                                      Our prediction model have three layers: first layer is
Table 1 reports the used environments and tools distin- MLP, second layer is CCE (it uses features extracted from
guishing between first set (datasets definition, prediction, MLP), and third layer combines predictions of both MLP
explainability) and second set (only dataset analysis).        and CCE. MLP have the following elements: dense, batch
                                                               normalization, activation–ReLU, dropout, dense, batch
                                                               normalization, activation–ReLU, dropout, dense, batch
5. Dataset definition                                          normalization, activation–ReLU, dropout, dense, activa-
                                                               tion–Softmax. CCE uses the features extracted from third
We generate eight datasets according to FINDRISC [1]. dense layer; for each testing node we implement this al-
Four datasets are for prediction experiments (2500 el- gorithm:
ements for testing; 1000, 1500, and 2000 for training),
and other four are for explainability experiments (1750             • Calculate Euclidean distance between the consid-
elements for testing; 1000, 1250, and 1500 for training).             ered testing node and all training nodes
The rule identifies risk individuals, without laboratory            • Normalize these distances to [0,1]
tests. It considers five risk levels in respect to score: very      • Using normalized distances, calculate Gaussian
low (0-3), low (4-8), moderate (9-12), high (13-20) and               kernel similarity between the considered testing
very high (21-26). All datasets are equally balanced in               node and all training nodes:
respect to the possible scores. These are the attributes to
be considered: BMI (weight (kg) / height squared (m2)),                                         −
                                                                                                   𝑑(𝑖,𝑗)2
                                                                                     𝑠𝑖𝑚𝑖,𝑗 = 𝑒 2 𝜎2                (9)


                                                          19
Alessandro Cabroni et al. CEUR Workshop Proceedings                                                                      17–25


Figure 1: Dataset distribution for input data (from left to right: prediction and explainability experiments).


Table 2
WEKA prediction tests using dataset with 2000 nodes (10-fold cross-validation).

                                                      Model      Accuracy
                                                       J48            0.826
                                                      KStar           0.71
                                                      MLP            0.7965
                                                       NB             0.713
                                                       RT            0.7595


     • Normalize these similarities to [0, 1]                        7.1. Accuracy definition
     • For each possible label (risk class), calculate the
                                                                     We consider two accuracy definitions (eq. 7 and eq. 8).
       sum of similarities
                                                                     The second definition uses the similarities between labels,
     • Normalize all sums to the overall sum                         because risk levels are orderable. In particular: np is the
     • Recalculate sum distribution, considering the sim-            number of unlabeled nodes, 𝑅𝐿𝑖 is the right label for i
       ilarity between labels too, according to the fol-             node, 𝑃 𝐿𝑖 (𝑗) is the value of probability distribution for
       lowing algorithm (m is the number of labels/risk              unlabeled node i considering label j, and m is the number
       classes):                                                     of possible labels (classes, risk levels).

        STemp=NP.copy(S)
        S[0]=STemp[0]+STemp[1]*(m-1)/m                               7.2. Hyper-parameters
        for h in range(1,m-1): S[h]=STemp[h]+(STemp[h-               Hyper-parameters have been chosen with some prelimi-
        1]+STemp[h+1])/2*(m-1)/m                                     nary tests. Initially, for MLP component we considered
        S[m-1]=STemp[m-1]+STemp[m-2]*(m-1)/m                         GridSearchCV for a first analysis. We produced a random
        S=S/NP.sum(S)                                                dataset of 1000 nodes using mlpClassifier with 200 max
                                                                     iterations and the following parameter space: hidden-
                                                                     LayerSizes with (128,256,32), (256,512,32), (512,1024,32);
   The third layer of the prediction model, considering the          learningRateInit with 0.01, 0.1; validationFraction with
single testing node, for each class produces the weighted            0.1, 0.2; batchSize with 50, 100. For this analysis, we
sum of probabilities obtained by both MLP and CCE for                obtained this results as best: batchSize=100, hiddenLay-
that class. Prediction is obtained by argmax function.


                                           ∑︀𝑛𝑝
                                              𝑖=1 𝑖𝑓 (𝑎𝑟𝑔𝑚𝑎𝑥𝑗∈{1,...,𝑚} 𝑃 𝐿𝑖 (𝑗) = 𝑅𝐿𝑖 , 1, 0)
                             𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =                                                                                      (7)
                                                                        𝑛𝑝
                                               ∑︀𝑛𝑝           |𝑎𝑟𝑔𝑚𝑎𝑥𝑗∈{1,...,𝑚} 𝑃 𝐿𝑖 (𝑗)−𝑅𝐿𝑖 |
                                                  𝑖=1 (1 −                 𝑚−1
                                                                                                )
                               𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦1 =                                                                                   (8)
                                                                         𝑛𝑝


                                                                20
Alessandro Cabroni et al. CEUR Workshop Proceedings                                                            17–25


Table 3
Hyper-parameters for prediction tests.

                           Component           Parameter                    Value
                              input       numberOfAttributes                    9
                              MLP             batchSizeMLP                    100
                              MLP               decayMLP                      1e-6
                              MLP        dropoutParameterMLP                  0.25
                              MLP              epochsMLP                      1000
                              MLP           learningRateMLP                   0.01
                              MLP         unitsFirstDenseMLP                  512
                              MLP        unitsSecondDenseMLP                  1024
                              MLP         unitsThirdDenseMLP      16, 32 (extracted features)
                              MLP        unitsFourthDenseMLP        5 (prediction classes)
                              MLP          validationSplitMLP                  0.2
                               CCE        gaussianKernelWidth                  0.5
                               WS           modelWeightMLP                    0.05
                               WS           modelWeightCCE                    0.95


Table 4
Accuracies results for MLP and GPA
        Model     Number of labelled nodes   Number of extracted features           Accuracy      Accuracy1
         MLP                 1000                          16                        0.8252         0.9559
         MLP                 1000                          32                        0.8188         0.9543
         MLP                 1500                          16                        0.8356         0.9587
         MLP                 1500                          32                        0.8392         0.9597
         MLP                 2000                          16                        0.8552         0.9638
         MLP                 2000                          32                        0.8564         0.9641
         GPA                 1000                          16                     0.82673008        0.9562
         GPA                 1000                          32                     0.82512752        0.9558
         GPA                 1500                          16                0.8399342399999999     0.9597
         GPA                 1500                          32                0.8463356799999999     0.9614
         GPA                 2000                          16                0.8599420799999999     0.9649
         GPA                 2000                          32                0.8583425600000001     0.9645


erSizes=(128, 256, 32), learningRateInit=0.01, validation- 8. Explainability model
Fraction=0.1. After other empirical tests, we chose the
hyper-parameters described in Table 3.                     Considering DTD theory, we define a simplified rule
                                                           to calculate the relevance of the single input parameter
                                                           against the single feature extracted from MLP first layer.
7.3. Prediction results                                    We consider the weights of the edges for the trained
In Table 4, we present accuracy results for both only MLP MLP. We do not consider biases. Moreover, we manage
component and all model GPA. In Figure 2, we outline the possible weight of the features for the Gaussian ker-
the differences between accuracy of GPA and MLP. As we nel distances in the CCE layer. The potential weight is
can see, we have a slightly better performance with GPA. calculated according to labelled nodes, corresponding
Moreover, the results are better than the results obtained to training data. We weight the standard deviation of
with the algorithms tested with WEKA tool. Of course, a feature. In fact, features with a high variation give a
we must remember that we are reasoning with restricted high contribute to the substantial distances and so they
random datasets and so our conclusions are useful only have a significant contribute for prediction classification
from a testing point of view and not for healthcare formal (we could also analyze better the possibility to normalize
deductions. In Table 5, we present execution times for training features values). We first obtain a formula for
prediction tests.                                          explainability which does not depend from the particular
                                                           input data and prediction. Then, we apply the formula
                                                           to a single input data by multiplication, so to calculate
                                                           the percentage of relevance for that parameter in the


                                                           21
Alessandro Cabroni et al. CEUR Workshop Proceedings                                                                               17–25


Figure 2: GPA-MLP accuracy/accuracy1 vs number of extracted feature (left: 16; right: 32).


Table 5
Execution times of prediction model.

                       Number of labelled nodes      Number of extracted features                 Execution time
                                  1000                               16                                 00:03:28
                                  1000                               32                                 00:03:31
                                  1500                               16                                 00:04:50
                                  1500                               32                                 00:04:53
                                  2000                               16                                 00:06:14
                                  2000                               32                                 00:06:21


considered prediction. Here, we use a forward definition                  • F is the number of extracted features
for explainability, maintaining the conservation of total                 • 𝑥𝑖𝑓 is the value of feature f for labelled node i
relevance. Formally, if we have a normalized unlabeled                    • N is the number of labelled nodes (training dataset)
node represented by the input data (𝑣1 , . . . , 𝑣𝑛 ), with               • 𝐶𝑙+1 is the number of neurons for layer l+1 of
∀𝑖 ∈ {1, . . . , 𝑛} 𝑣𝑖 ∈ [0, 1], we can establish the rele-                 MLP (layer 0 is the input data)
vance of input i in the prediction as 𝑅𝑖,𝑣 ddefined in the                • LF is the layer of MLP for features extraction;
equations (10), (11), (12), (13). Where:                                    for a particular f, this layer has only one neuron


                                                         ∑︀𝐹             𝑓
                                                   𝑣𝑖       𝑓 =1 [𝑅𝑛𝑜𝑟𝑚𝑖,0 𝜎𝑓 ]
                                         𝑅𝑖,𝑣 = ∑︀𝑛          ∑︀𝐹           𝑓
                                                                                                                                   (10)
                                                     𝑗=1 𝑣𝑗     𝑓 =1 [𝑅𝑛𝑜𝑟𝑚𝑗,0 𝜎𝑓 ]


                            √︂               ∑︀𝑁
                                                                                 √︂
                                                   𝑖                                                     ∑︀𝑁    𝑖
                              ∑︀𝑁      𝑖      𝑖=1 𝑥𝑓                                  ∑︀𝑁    𝑖             𝑖=1 𝑥𝑓
                                 𝑖=1 (𝑥𝑓 −     𝑁
                                                       )2 − 𝑚𝑖𝑛𝑓 ∈{1,...𝑓 }            𝑖=1 (𝑥𝑓 −            𝑁
                                                                                                                    )2
              𝜎𝑓 =                 √︂                ∑︀𝑁
                                                                                             √︂                                    (11)
                                                            𝑖                                                      ∑︀𝑁   𝑖
                                      ∑︀𝑁      𝑖       𝑖=1 𝑥𝑓                                     ∑︀𝑁    𝑖          𝑖=1 𝑥𝑓
                     𝑚𝑎𝑥𝑓 ∈{1,...𝑓 }     𝑖=1 (𝑥𝑓 −       𝑁
                                                                )2 − 𝑚𝑖𝑛      𝑓 ∈{1,...𝑓 }         𝑖=1 (𝑥𝑓 −         𝑁
                                                                                                                             )2


                                                       𝐶𝑙+1               𝑓        +
                                                        ∑︁ 𝑅𝑛𝑜𝑟𝑚𝑗,𝑙+1 𝑤𝑖𝑙 𝑗 𝑙+1
                                         𝑅𝑛𝑜𝑟𝑚𝑓𝑖,𝑙 =         ∑︀𝐶𝑙   +
                                                                                                                                   (12)
                                                        𝑗=1    𝑘=1 𝑤𝑘𝑙 𝑗 𝑙+1


                                                      𝑅𝑛𝑜𝑟𝑚𝑓1,𝐿𝐹 = 1                                                               (13)


                                                                22
Alessandro Cabroni et al. CEUR Workshop Proceedings                                                                    17–25


Table 6
Hyper-parameters for explainability tests.

                            Component                Parameter                         Value
                               input           numberOfAttributes                         9
                               MLP                 batchSizeMLP                          100
                               MLP                   decayMLP                           1e-6
                               MLP            dropoutParameterMLP                       0.25
                               MLP                  epochsMLP                           1000
                               MLP               learningRateMLP                        0.01
                               MLP             unitsFirstDenseMLP                        64
                               MLP            unitsSecondDenseMLP                        128
                               MLP             unitsThirdDenseMLP              8 (extracted features)
                               MLP            unitsFourthDenseMLP              5 (prediction classes)
                               MLP              validationSplitMLP                       0.2
                                CCE            gaussianKernelWidth                       0.5
                                WS               modelWeightMLP                         0.05
                                WS               modelWeightCCE                         0.95
                           explainability                C                          [9,64,128,1]
                           explainability                LF                               3
                           explainability    MLPLevelsForExplainability                [0,4,8]


       corresponding to the particular feature extracted
       f
     • 𝑤𝑖+𝑙 𝑗 𝑙+1 is the absolute value of the weight of MLP
       for the edge which connect neuron i of layer l
       with neuron j of layer l+1
As we can see, 𝑣𝑖 is the only element related to the par-
ticular input data. All the other elements expressed in
the formulas depends only on fixed hyper-parameters
and on original training dataset. Moreover, in our tests,
we calculate the constant components of 𝑅𝑖,𝑣 only once,
so to optimize the tests computation.

8.1. Hyper-parameters                                               Figure 3: Average relevancies vs number of training nodes
Starting from hyper-parameters used for prediction tests, (left/top: 1000; right/top: 1250; left/bottom: 1500).
we simplified MLP component establishing new hyper-
parameters for explainability tests, where we repeated
training and prediction process too. We report the chosen 9. Conclusion
configuration values in Table 6.
                                                             In this work, we have proposed an explainable model to
8.2. Explainability results                                  predict diabetes risk. We have tested our model using ran-
                                                             dom defined datasets produced according to a healthcare
In Figure 3, we present the results of explainability tests. rule named FINDRISC. We chose to define random data to
In particular, we can see the average relevancies for all have the possibility to evaluate our model in a controlled
input data in respect to all testing predictions. E.g., we manner (input data are sufficiently distributed and risk
can see particular relevancies for parameter 1 (age), 2 predictions are equally distributed) and to overcome any
(BMI), 3 (waist circumference), 8 (family history) and privacy problem. We defined our model, named GPA, us-
less relevancies for parameter 2 (gender). Of course, we ing three layers. First layer considers a MLP module and
must remember that we are reasoning with restricted it is used both to produce a first partial mixed prediction
random datasets and so our conclusions are useful only and to extract features for the second layer. This layer,
from a testing point of view and not for healthcare formal named CCE, produces a partial mixed prediction consid-
deductions. In Table 7, we present execution times for ering the similarity between a single unlabeled node in
explainability tests.


                                                               23
Alessandro Cabroni et al. CEUR Workshop Proceedings                                                                    17–25


Table 7
Execution times of explainability model.

                       Number of labelled nodes      Number of extracted features     Execution time
                                  1000                             8                      00:02:44
                                  1250                             8                      00:02:32
                                  1500                             8                      00:02:39


respect to all labelled nodes, managing class distances              cisions with deep Taylor decomposition, Pattern
too. In this layer, a node is represented by the features ex-        Recognition, 2017
tracted from first layer. Third layer, named WS, considers       [8] L. Kopitar, et al. (2019) Local vs. Global Interpretabil-
the sum of the partial mixed prediction of the first and             ity of Machine Learning Models in Type 2 Diabetes
second layer (in a weighted manner) to obtain the final              Mellitus Screening. https://doi.org/10.1007/978-3-
prediction by argmax. Experimentally, we noticed that                030-37446-4_9
accuracy improves using the all GPA model in respect to          [9] F Mazzenga, R Giuliano, F Vatalaro, "FttC-
using only MLP layer. Moreover, we noticed that accu-                based fronthaul for 5G dense/ultra-dense ac-
racy results are better than considering accuracy results            cess network: Performance and costs in real-
produced using some algorithms of WEKA tool. The                     istic scenarios", Future Internet 9 (4), 71, 2017,
most contribute of our research is the explainability of             https://doi.org/10.3390/fi9040071
our model in terms of input parameters, useful for a MD [10] Russo, S., Illari, S.I., Avanzato, R., Napoli, C. Reduc-
(medical doctor) understanding, also considering more                ing the psychological burden of isolated oncological
predictions together. Generally, we must remember that               patients by means of decision trees (2020) CEUR
now our conclusions are useful only from a testing point             Workshop Proceedings, 2768, pp. 46-53.
of view and not for real deductions.                            [11] Giuliano, R., Mazzenga, F., Vizzarri, A., "Satellite-
                                                                     Based Capillary 5G-mMTC Networks for Environ-
                                                                     mental Applications", IEEE Aerospace and Elec-
References                                                           tronic Systems Magazine, 2019, 34(10), pp. 40–48,
                                                                     https://doi.org/10.1109/MAES.2019.2923295
 [1] https://www.mdcalc.com/findrisc-finnish-
                                                                [12] Capizzi, G., Lo Sciuto, G., Napoli, C., Tramontana, E.
      diabetes-risk-score, last accessed 2021/16/07
                                                                     A multithread nested neural network architecture
 [2] Weka 3: Machine Learning Software in Java,
                                                                     to model surface plasmon polaritons propagation
      https://www.cs.waikato.ac.nz/ml/weka/index.html,
                                                                     (2016) Micromachines, 7 (7), art. no. 110 (7)
      last accessed 2021/16/07
                                                                [13] Capizzi, G., Lo Sciuto, G., Napoli, C., Tramontana,
 [3] Xiong, Xl., Zhang, Rx., Bi, Y. et al. Machine Learning
                                                                     E., Woźniak, M. A novel neural networks-based tex-
      Models in Type 2 Diabetes Risk Prediction: Results
                                                                     ture image processing algorithm for orange defects
      from a Cross-sectional Retrospective Study in Chi-
                                                                     classification (2016) International Journal of Com-
      nese Adults. CURR MED SCI 39, 582–588 (2019).
                                                                     puter Science and Applications, 13 (2), pp. 45-60.
      https://doi.org/10.1007/s11596-019-2077-4
                                                                     (7)
 [4] Illari, S.I., Russo, S., Avanzato, R., Napoli, C. A cloud-
                                                                [14] Napoli, C., Bonanno, F., Capizzi, G. Exploiting
      oriented architecture for the remote assessment
                                                                     solar wind time series correlation with magneto-
      and follow-up of hospitalized patients. (2020) CEUR
                                                                     spheric response by using an hybrid neuro-wavelet
      Workshop Proceedings, Vol. 2694, pp. 29-35.
                                                                     approach. (2010) Proceedings of the International
 [5] Bach S, Binder A, Montavon G, Klauschen F,
                                                                     Astronomical Union, 6 (S274), pp. 156-158. DOI:
      Müller K-R, Samek W, On Pixel-Wise Explana-
                                                                     10.1017/S1743921311006806
      tions for Non-Linear Classifier Decisions by Layer-
                                                                [15] Jobeda Jamal Khanam, Simon Y. Foo, A compar-
      Wise Relevance Propagation, PLOS ONE, 2015,
                                                                     ison of machine learning algorithms for diabetes
      https://doi.org/10.1371/journal.pone.0130140
                                                                     prediction, ICT Express, 2021, ISSN 2405-9595
 [6] Napoli C, Pappalardo G, Tramontana E, A hybrid
                                                                [16] Neha Prerna Tigga, Shruti Garg, Prediction of
      neuro–wavelet predictor for qos control and stabil-
                                                                     Type 2 Diabetes using Machine Learning Classi-
      ity. (2013) Lecture Notes in Computer Science, Vol.
                                                                     fication Methods, Procedia Computer Science, Vol-
      8249 LNAI, pp. 527–538.
                                                                     ume 167, 2020, Pages 706-716, ISSN 1877-0509,
 [7] Montavon G, Lapuschkin S, Binder A, Samek W,
                                                                     https://doi.org/10.1016/j.procs.2020.03.336
      Müller K-R, Explaining nonlinear classification de-
                                                                [17] M. K. Hasan, et al., "Diabetes Prediction Using En-
                                                                     sembling of Different Machine Learning Classifiers",


                                                             24
Alessandro Cabroni et al. CEUR Workshop Proceedings        17–25


     2020, doi: 10.1109/ACCESS.2020.2989857
[18] Montavon G, et al., Layer-Wise Relevance Prop-
     agation: An Overview, Explainable AI: Interpret-
     ing, Explaining and Visualizing Deep Learning,
     https://doi.org/10.1007/978-3-030-28954-6_10
[19] Montavon G, Bach S, Binder A, Samek W, Müller
     K-R, Deep Taylor Decomposition of Neural Net-
     works, Proceedings of the ICML’16 Workshop on
     Visualization for Deep Learning, 2016
[20] Kauffmann J, et al.,Towards explaining
     anomalies:        A deep Taylor decomposi-
     tion of one-class models,Pattern Recogni-
     tion,https://doi.org/10.1016/j.patcog.2020.107198


                                                      25