Probabilistic Neuro-Fuzzy System in Medical Diagnostic Task
and its Lazy Learning-Selflearning
Yevgeniy Bodyanskiya, Anastasiia Deinekoa, Iryna Pliss a and Olha Chala a
a
    Control systems research laboratory, Kharkiv National University of Radio Electronics, Kharkiv, Ukraine


                 Abstract
                 The computational intelligence system that is a hybrid of probabilistic neural network and the
                 neuro-fuzzy system is proposed for solving the medical diagnostic tasks. The distinctive
                 feature of the proposed system is the ability to process data that are given in different scales:
                 numerical, ordinal, nominal, and binary. Also, the tuning process of the system is a hybrid of
                 lazy learning and selflearning, according to T. Kohonen. Moreover, it is characterized by
                 high processing speed, comparing to deep neural networks which are learning with error
                 backpropagation procedures. The diagnostic system that is under consideration is
                 characterized by uncomplicated computational implementation. It is intended to work with
                 both short and long datasets in conditions of overlapping classes of diagnosis, which is
                 typical for medical applications.

                 Keywords 1
                 Medical Data Mining, Probabilistic Neural Network, Neuro-Fuzzy System, Membership
                 Function, Lazy Learning, Pattern Recognition

1. Introduction
    Data mining methods are currently widely used in the analysis of medical information [1-3] and,
first of all, in diagnosis problems based on the available data on the patient's state. As a rule, medical
diagnostics problems from a data mining standpoint are considered either problem of pattern
classification-recognition, clustering - recognition without a teacher, or forecasting-prediction of the
disease course. Methods of computational intelligence [4-6] adapted for solving medical problems [7-
10] proved to be the best mathematical apparatus here. Artificial neural networks have proved to be
efficient due to their ability to train parameters - synaptic weights (and sometimes architecture) to
process a training dataset, which ultimately allows or restores distributing hypersurfaces between
classes of diagnoses arbitrarily false shapes. Here deep neural networks have effectively demonstrated
their capabilities [11, 12], which provide recognition accuracy entirely inaccessible for other
approaches.
    Simultaneously, there is a broad class of situations when deep neural networks are either
ineffective or generally inoperable. Here, notably the problems with a short training dataset, which
often happens in real medical cases. Also, medical information is often presented not only in a
numerical scale of intervals and relationships but also in a nominal, ordinal (rank) or binary scale.
    Probabilistic neural networks (PNNs) [13] are well suited for solving recognition-classification
problems under conditions of a limited amount of training data [13], which, however, are crisp
systems operating in conditions of non-overlapping classes and learning in a batch mode. In [14-18],
fuzzy and online PNN modifications were introduced to solve recognition problems under
overlapping classes and trained in sequential mode. The main disadvantages of these systems are their

IDDM’2020: 3rd International Conference on Informatics & Data-Driven Medicine, November 19–21, 2020, Växjö, Sweden
EMAIL: yevgeniy.bodyanskiy@nure.ua (Ye. Bodyanskiy); anastasiia.deineko@nure.ua (A. Deineko); iryna.pliss@nure.ua (I. Pliss);
olha.chala@nure.ua (O. Chala)
ORCID: 0000-0001-5418-2143 (Ye. Bodyanskiy); 0000-0002-3279-3135 (A. Deineko); 0000-0001-7918-7362 (I. Pliss); 0000- 0002-7603-
1247 (O. Chala)
                2020 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)
cumbersomeness (the size of the training dataset determines the number of nodes in the pattern layer)
and the ability to work only with numerical data. The ability to work with data in different scales is an
advantage of neuro-fuzzy systems [19]. Here, for the problem under consideration ANFIS, Takagi-
Sugeno-Kang, Wang-Mendel and other systems can be noted.
    Unfortunately, training these systems (setting their synaptic weights, and some-times membership
functions) may require relatively large amounts of training datasets [20]. In this regard, it seems
expedient to develop a hybrid of a probabilistic neural network (PNN) and a neuro-fuzzy system for
solving classification-diagnostics-recognition problems in the conditions of overlapping classes and
training data in different scales, as well as the ability to instantaneous tuning based on lazy
learning [21].


2. The architecture of the proposed system
    The proposed probabilistic neuro-fuzzy system contains four layers of information processing: the
first hidden layer of fuzzification, formed by one-dimensional bell-shaped membership functions, the
second hidden layer - aggregation one, formed by elementary multiplication blocks, the third hidden
layer of adders, the number of which is determined by the number of classes plus one per which
should be split the original data array, and, finally, the fourth - the output defuzzification layer,
formed by division blocks, at the outputs of which signals appear that determine the levels of fuzzy
membership of each observation to each of the possible classes.
    Unlike classical neuro-fuzzy systems, here is no layer of tuning weights parameters. As will be
shown below, the proposed method's learning process is implemented in the first hidden layer by
adjusting the membership functions' parameters. It is clear that this approach simplifies the numerical
implementation of the system and improves its performance.
    The initial information for the system synthesis is a training dataset formed by a set of n-
dimensional images-vectors x(k )  ( x1 (k ), x2 (k ),..., xi (k ),... xn (k ))T each of which (here 1  k  N
observation number in the original array or the moment of the current discrete time in Data Stream
Mining tasks) belongs to a specific class Cl j , j 1,2,..., m. It is convenient to rearrange the original
training dataset so that the first N1 observations belong to the first class Cl1 following N 2 observations
to Cl2 and finally N m latest observations to class Clm . Moreover, for each class, instead of the index
number k, it is convenient to introduce an intraclass numbering so that for the first class Cl1
 k  t1 1, 2,..., N1 ; for class Cl2 k  t2  N1  1, N1  2,..., N1  N 2 ; and, finally, for the last m-th class Clm
 k  tm  N1  N 2  ...  N m1  1,..., N1  N1  ...  N m  N .
    Based on this training dataset, the first hidden fuzzification layer is formed by Gaussian
membership functions
                                                                                                             (1)
                                                                           2
                                 li ( x , w )  exp  0.5 2 x  w
                                      i   li                     i   li       ,
where wl - fixed or adjustable (more generally) centers of the corresponding membership functions,
          i


 - parameter specifying the width of the corresponding function also fixed or tuned
  2

li  1,2,..., hi ; i  1,2,..., n.
    Note that in a standard probabilistic neural network, the first hidden layer of patterns is formed by
multidimensional Gaussians, the number of which is determined by the training dataset size N. In the
proposed system, the number of membership functions at each input can be different, for example, if a
binary variable of type 1 or 0, "Yes" or "No," "there is a symptom," or "there is no symptom," is
supplied to the input then two functions are enough at this input ( hi  2 ); if at the i-th input, the
corresponding variable can take an arbitrary number of values, then 2  hi  N . The total number of
one-dimensional functions in the system varies in the interval
                                           2n  h  Nn                                              (2)
                 n
where h   hi .
                i 0
   At the input of the first hidden layer, h signal-values of the corresponding Gaussians appears
                                        ol[1]  l ( xi , wi ).
                                                          i                  i
                                                                                                  (3)
   Then they are fed to the second - aggregation hidden layer, which, similarly to standard neuro-
fuzzy systems, is formed by ordinary multiplication blocks, which is equal to N.
   In this layer from one-dimensional membership functions that multidimensional kernel activation
functions are formed

                                             
                 t j ( xi , wt )   exp  0.5 2  xi  wl   exp   0.5 2 x  c                                          ,
                                     N                                                      2                                   2
                                                                                                                                        (4)
                       j   j
                                     i 1
                                                                                      i
                                                                                                                           j
                                                                                                                                    
                                                                                          are formed with the centers of one-dimensional
                                                                                          T
the vector centers of which wt j  wl1 ,..., wli ,..., wln
membership functions. Moreover, for each j-th class, multidimensional activation functions N j are
formed. As a result, a signal is generated at the output of the second hidden layer
                                      ot[2]i  t j ( x(t j ), wt ).
                                                     j               j
                                                                                                 (5)j


   The third hidden layer is formed from the blocks of summation, the number of which is
determined by the value m + 1. The first m adders calculate the data density distribution for each class
                                                                                 N1  N 2 ... N j

                                            p j ( x)  o      [3]
                                                              j                          
                                                                         t j  N1  N 2 ... N j 1 1
                                                                                                               ot[2]
                                                                                                                  jj
                                                                                                                                         (6)

and (m+1)-th one overall data density distribution
                                                                         m                              m
                                            p( x)  o[3]   p j ( x)   o[3]
                                                                           j .                                                           (7)
                                                                     j 1                               j 1

   In the output layer of defuzzification, the probability level is calculated that the presented
observation x belongs to the j-th class
                                                                                                                       1
                                                                      m           
                                   yˆ j ( x)  o ( x)o ( x)  o ( x)   o[3]
                                             [3]
                                             j
                                                         [3]         1
                                                                            j ( x)  .
                                                                                      [3]
                                                                                      j
                                                                                                                                         (8)
                                                                      j 1        
      It is obvious that
                                                               m

                                                           yˆ ( x)  1.
                                                              j 1
                                                                         j                                                               (9)


3. Combined training of probabilistic neuro-fuzzy system
   In general, the proposed system's settings can be implemented based on the so-called lazy learning
[21] in the same way as a standard PNN is configured. Lazy learning is based on the principle
"Neurons at data points", when the kernel activation functions' centers coincide with the observations
from the training set. For each observation x (t j ) , a multidimensional bell-shaped activation function
t j ( xi , wt ) (where wt  x(ti ) ) is formed. It is clear that such a learning process is implemented
  j         j                  j

rapidly. Still, if the amount of the training sample N is large enough, the PNN system becomes too
cumbersome.
   Following this approach, N membership functions should be formed at each input in a neuro-fuzzy
system in the fuzzification layer. However, suppose the training signals on different inputs are
specified either in the nominal or binary or in the rank scales. In this case, the number of membership
functions at the corresponding inputs decreases significantly. In addition, in medical applications,
numerical variables, such as the patient's temperature, are often repeated, leading to the conjugation of
the number of membership functions. Finally, the most straightforward case compensates when all
input signals are specified in a binary scale: "there is a symptom" – "there is no symptom". Only two
membership functions with center coordinates 0 and 1 are formed at each input.
    In the case when all the input variables are specified on a numerical scale, the number of one-
dimensional membership functions is determined by the value hi  N ; h  Nn that, with larger
volumes of the training dataset, we can make the system too cumbersome. It is possible to overcome
this problem using the self-learning procedure of the centers of membership functions [22], while
their number hi at each input remains constant.
   Let’s set the maximum possible value of the number of membership functions at the i-th input hi*
and, before starting the learning process, place them evenly along the axis xi on the interval [0, 1] so
that the value determines the distance between the original centers wli (0) and wli 1 (0) determined by
the value
                                                       i (0)   hi*  1 .
                                                                           1
                                                                                                                   (10)
       When     the     first     vector           from    the       training      dataset     is
                                                                                         fed to the system input
x 1   x1 (1),..., xi (1),..., xn (1) 
                                             1
                                                  (it does not matter which of the classes Cl j it belongs to), the center-
                *
"winner" wli (0) is determined at the beginning, which is the nearest xi (1) in the sense of distance
                                                      dli i  xi (1)  wli (0) ,                                   (11)
i.e.
                                                                 
                                        wl*i (0)  arg min dl1i ,..., dli i ,..., d l h* .
                                                                                     i i
                                                                                                                  (12)
   After this center-“winner” is pulled up to the input signal xi (1) component according to the
expression
                           wli (1)  wl*i (0)  i (1)( xi (1)  wl*i (0)).              (13)
where 0  i (1)  1 - is the learning rate. It is clear that when i (1)  1 center-"winner" moves to a
point xi (1) using the principle of "neurons at data points".
   At the k-th iteration, the tuning procedure can be written in the form
                               wl*i (k  1)  i (k )( xi (k )  wl*i (k  1))
                              
                   wli (k )  if wl*i (k  1)  " winner ", li  1, 2,..., hi ; i  1, 2,..., n, (14)
                               
                                wli (k  1) otherwise.
   It is easy to see that the last expression implements the self-learning principle of T.Kohonen [23]
“Winner Takes All” (WTA).
   Thus, the combination of lazy learning and self-learning can significantly simplify both the
architecture and the process of tuning the probabilistic neuro-fuzzy system.

4. Results of the experiment
   The proposed probabilistic neuro-fuzzy system is designed to work with different data types such
as numerical and binary data that are presented in long and short datasets. Therefore, two datasets
with different data types were taken from the UCI repository for the experimental evaluation.
   The first dataset, "Heart Disease," contains 303 instances. Each of them includes detailed
information about the patient, his or her physiological parameters, and symptoms of a disease. This
dataset is a mix of numerical and binary data. Physiological parameters have numerical form, and
symptoms typically have a binary form.
   The second dataset, "Diabetes 130-US hospitals for years 1999-2008", is a long dataset that
contains 100000 instances. It includes features that represent outcomes of treatment for patients: the
length of stay at the hospital, information about the laboratory tests, and medications administered
when patients were at the hospital. This dataset also contains numerical and binary data.
   The first experiment was carried out with a short set of medical data. This small set has been
further subdivided into subsets in order to determine the minimum number of data items to obtain
practical classification results. The classification accuracies for machine learning method KNN – K-
nearest neighbour, EFPNN – evolving fuzzy-probabilistic neural network [24], and the proposed one
were compared. The experimental evaluation results are represented in Table 1.

Table 1
The algorithms’ accuracy comparison for small datasets
   Algorithms for comparison         Classification accuracies                 Max time, sec
                                  100       150      200     250
              KNN                50.24 51.63 50.7 49.03                             0.03
             EFPNN               56.07     61.9 71.83 79.02                          0.1
             PNFSL               51.14 57.7 69.34 77.52                             0.79

   The experiment results show that the KNN algorithm is fast, but an accuracy of it is close to 50%.
Thus, the algorithm is not intended for classification of very short samples. Unlike KNN, the
proposed network's classification accuracy increases as the number of elements in the sample
increases. Even on very small samples, it achieves an accuracy of 77%. EFPNN also allows for
greater accuracy as the sample size increases. However, it is significantly, more than 20% slower than
the proposed network. The fastest method is KNN, but it should take into account that neural
networks are implemented on Python and run on the central processor, and not on the GPU like KNN.
It means that with the same hardware implementation, the time costs for all methods will be
comparable. But the accuracy of the proposed network is higher.


Figure 1: The dependency of dataset size and the increase of classification time


   The second experiment was performed on the long dataset, which is called "Diabetes 130-US
hospitals for years 1999-2008". From the initial dataset, a number of subsets that have different sizes,
from 3000 to 30 000 instances were formed. The experiment is intended to compare the increase of
the classification time with the dataset size grows because the absolute time consumption depends on
the computer platform and used processor (CPU, GPU). Based on the results of the first experiment,
for the second one, two PNFS and EFPNN neural networks, which provide a higher classification
accuracy, were selected. The experiment showed that the proposed approach requires less
computational cost than EFPNN. The increase in time required for larger subsets increases
significantly compared to small subsets. This trend apparently exposes the influence of the software
on the classification time. Smaller datasets are usually allocated in RAM, while long datasets require
swapping of data from external memory.
   In general, according to the results of two experiments, the proposed approach in comparison with
EFPNN provides slightly lower classification accuracy for small datasets but requires significantly
lower computational costs when the dataset size grows.

5. Conclusion
   The probabilistic neuro-fuzzy system is proposed for solving problems of classification-medical
diagnostics in terms when information about the patient's condition is set simultaneously in
numerical, rank (ordinal), nominal and binary scales. A feature of the system that is under
consideration is the ability to work in the conditions of both short and growing long training sets
when further they are sequentially fed into the system in online mode. The system configuration
process is based on both lazy learning and self-learning, which significantly simplifies the system's
computational implementation. The proposed method is characterized by high speed (just-in-time
learning) and simplicity of numerical implementation, confirmed by the experiment results.

6. References
[1] Eu. Giannopoulou, Data Mining in Medical and Biological Research. N.Y.: ITAC, 2008. doi:
     10.5772/95.
[2] P. Berka, S. Rauch, D. Zighed, Data Mining and Medical Knowledge Management: Cases and
     Applications. N.Y.: Herskey, 2009. doi: 10.4018/978-1-60566-218-3.
[3] A. Karahoca, Data Mining Applications in Engineering and Medicine. InTechOpen, 2012. doi:
     10.5772/2616.
[4] C. Mumford, L. Jain, Computational Intelligence, Collaboration, Fuzzy and Emergence, Berlin:
     Springer, Vergal, 2009. doi: 10.1007/978-3-642-01799-5.
[5] R. Kruse, C. Borgelt, F. Klawonn, C. Moewes, M. Steinbrecher, P. Held, Computational
     Intelligence. A Methodological Introduction. Berlin: Springer-Verlag, 2013. doi: 10.1007/978-1-
     4471-5013-8.
[6] J. Kacprzyk, W. Pedrycz, Springer Handbook of Computational Intelligence, Berlin Hei-delberg:
     Springer, Verlag, 2015. doi: 10.1007/978-3-662-43505-2.
[7] R. Kantchev, R.: Advances in Intelligent Analysis of Medical Data and Decision Support
     Systems. Springer, 2013. doi: 10.1007/978-3-319-00029-9.
[8] M. Schmitt, H.-N. Teodorescu, A. Jain, A. Jain, S. Jain, Computational Intelligence Processing in
     Medical Diagnosis. Springer-Verlag Berlin Heidelberg , 2002. doi: 10.1007/978-3-7908-1788-1.
[9] Yu. Syerov, N. Shakhovska, S. Fedushko, Method of the Data Adequacy Determination of
     Personal Medical Profiles. Advances in Artificial Systems for Medicine and Education II.
     Springer Nature Switzerland AG, 2018, pp. 333-343. doi: 10.1007/978-3-030-12082-5_31.
[10] Ye. Bodyanskiy, I. Perova, O. Vynokurova, I. Izonin, Adaptive wavelet diagnostic neuro-fuzzy
     network for biomedical tasks. Conference: 14th International Conference on Advanced Trends in
     Radioelecrtronics, Telecommunications and Computer Engineering (TCSET), 2018, pp. 711 –
     715. doi: 10.1109/TCSET.2018.8336299.
[11] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning. MIT Press, 2016.
[12] D. Graupe, Deep Learning Neural Networks: Design And Case Studies. N.Y.: World
     Scientific, 2016.
[13] D. F. Specht, Probabilistic neural networks, Neural Networks, volume 3, pp. 109-118 (1990).
     doi: 10.1016/0893-6080(90)90049-Q.
[14] Ye. Bodyanskiy, Ye. Gorshkov, V. Kolodyazhniy, J. Wernstedt, A learning of probabilistic
     neural network with fuzzy inference. Proc. Sixth. Int. Conf. on Artificial Neural Nets and
     Generic Algorithms “ICANNGA 2003”, Wien: Springer Verlag, pp. 13-17 (2003). doi:
     10.1007/978-3-7091-0646-4_3.
[15] Ye. Bodyanskiy, Ye. Gorshkov, V. Kolodyazhniy, J. Wernstedt, Probabilistic neuro-fuzzy
     network with non-conventional activation functions, volume 2773 of Lecture Notes in Artificial
     Intelligence. Berlin Heidelberg New York: Springer, 2003. doi: 10.1007/978-3-540-45226-
     3_133.
[16] L. Rutkowski, Adaptive probabilistic neural networks for pattern classification in time-varying
     environment. IEEE Trans. on Neural Networks, 2004, pp. 811-827. doi:
     10.1109/TNN.2004.828757.
[17] J.-H. Yi, J. Wang, G.-G. Wang, Improved probabilistic neural networks with self-adaptive
     strategies for transformer fault diagnosis problem advances, Mechanical Engineering, vol. 8,
     pp. 1-13 (2016).
[18] P. Zhernova, I. Pliss, O. Chala, Modified fuzzy probabilistic neural network. Intellectual Systems
     For Decision Making and Problems of Computational Intelligence ISDMCI’2018, pp. 228-230,
     Kherson: PP Vyshemirsky V. S., (2018).
[19] Souza, P.V.C.: Fuzzy neural networks and neuro-fuzzy networks: A review the main techniques
     and applications used in the literature. Applied Soft Computing, vol. 92 (2020).
[20] S. Osowski, Sieci neuronowe do przetwarzania informacji, Warszawa: Oficijna Wydawnicza
     Politechniki Warszawskiej, 2006.
[21] O. Nelles, Nonlinear Systems Identification. Berlin: Springer, 2001. doi:10.1007/978-3-662-
     04323-3.
[22] Ye. Bodyanskiy, O. Tyshchenko, A. Deineko, Evolving Neuro-fuzzy Systems with Kernel
     Activation Functions: Their Adaptive Learning for Data Mining Tasks, volume 58, Saar-brüken:
     LAP LAMBERT Academic Publishing, 2015.
[23] T. Kohonen, Self-Organizing Maps. Berlin: Springer-Verlag (1995). doi: 10.1007/BF02844683.
[24] Ye. Bodyanskiy, A. Deineko, I. Pliss, O. Chala, Evolving fuzzy-probabilistic neural network and
     its online learning. 2020 10th International Conference on Advanced Computer Information
     Technologies, ACIT 2020 – Proceedings, 2020, pp. 465-468.