Mathematical and Information Technologies, MIT-2016 — Information technologies

      Construction Features and Data Analysis
       by BP-SOM Modular Neural Network

                  Vladimir Gridin and Vladimir Solodovnikov

       Design information technologies Center Russian Academy of Sciences,
                        Odintsovo, Moscow region, Russia
                               info@ditc.ras.ru
                              http://ditc.ras.ru


      Abstract. Data are a valuable resource that keeps a great potential for
      recovery of the useful analytical information. One of the most promising
      toolkits to solve the problems of data mining could be the usage of neural
      network technology. The problem of initial parameters values and ways
      of neural network construction on example of a multilayer perceptron
      are considered. Also information about the task and available raw data
      are taken into account. The modular BP-SOM network, which combines
      the multi-layered feed-forward network with the Back-Propagation (BP)
      learning algorithm and Kohonens self-organising maps (SOM), is sug-
      gested for visualization of the internal information representation and
      the resulting architecture assessment. The features of BP-SOM func-
      tioning, methods of rule extraction from trained neural networks and
      the ways of the result interpretation are presented.

      Keywords: neural network, multilayer feedforward network, Kohonen
      self-organizing maps, modular network BP-SOM, rules extraction.


1   Introduction
Data analysis processes are often related to the tasks which are characterized by
the lack of information about the sample structure, dependencies and distribu-
tions of analyzed indicators. One of the closest correspondences to this condition
could be the usage of an approach based on neural network technology. The
ability of neural networks to train, simulate nonlinear processes, deal with noisy
data, extract and generalize the essential features from the incoming information
makes them one of the most promising toolkits in solving data mining problems.
However, there are several difficulties in using of this approach. In particular
there is the problem of choosing an optimal network topology, parameter values
and structural features that would best meet the problem being solved on avail-
able raw data. This is due to the fact that different neural networks can show
very similar results on the samples from training set and significantly different
when working with new, never shown data. Designing the optimal topology of
the neural network can be represented as a search of architecture that provides
the best (relative to the chosen criterion) solution of a particular problem. Usu-
ally the particular architecture and structural features could be selected on the

                                                                                      114
Mathematical and Information Technologies, MIT-2016 — Information technologies

results of the assessment based on knowledge of the problem and the available
source data. After that, the training and testing processes are taking place. Their
results are used in the decision-making process, that the network meets all the
requirements.
    Another complication in using neural network approach could be related with
the results interpretation and its preconditions. Especially clearly, this problem
appears for the multilayer perceptron (MLP) [1]. In fact, neural network acts as
a ”black box”, where the source data are sent to the input neurons and the result
is got from the output, but the explanation about the reasons of such a solution
is not provided. The rules are contained in the weight coefficients, activation
functions and connections between neurons, but usually their structure is too
complex for understanding. Moreover, in the multilayer network, these param-
eters may represent non-linear, non-monotonic relationship between input and
target values. So, generally, it is not possible to distinguish the influence of a
certain characteristic to the target value, because the effect is mediated by the
values of other parameters. Also, you may experience some difficulties in using
the learning algorithm of back-propagation BP, both with local extremes of the
error function, and with the solutions of a certain class of problems.


2     The architecture and initial values choice for the neural
      network

The choice in favor of neural network architecture can be based on knowledge
of the problem being solved and the available source data, their dimension and
the samples scope. There are different approaches for choosing the initial values
of the neural network characteristics. For example, the ”Network Advisor” of
the ST Neural Networks package offers by default one intermediate layer with
the number of elements equals to the half of the sum of the quantity of inputs
and outputs for the multilayer perceptron. In general, the problem of choosing
the number of hidden elements for the multilayer perceptron should account
two opposite properties, on the one hand, the number of elements should be
adequate for the task, and on the other, should not be too large to provide
the necessary generalization capability and avoid overfitting. In addition, the
number of hidden units is dependent on the complexity of the function, which
should be reproduced by the neural network, but this function is not known in
advance. It should be noted that while the number of elements increases, the
required number of observations also increases. As an estimate, it is possible to
use the principle of joint optimization of the empirical error and the complexity
of the model, which takes the following form [1]:

           min{𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑟𝑟𝑜𝑟 + 𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑜𝑑𝑒𝑙}.             (1)

The first part of this expression is responsible for the accuracy of the approxima-
tion and the observed learning error. The less it is - the less bits it is needed to
correct the model predictions. If the model predicts the data accurately, then the

115
            Mathematical and Information Technologies, MIT-2016 — Information technologies

length of the error description equals to zero. The second part makes sense to the
amount of information needed to select a specific model from the set of all pos-
sible. Its accounting allows to apply the necessary constraints on the complexity
of the model by suppressing an excessive amount of tuning parameters.
    The accuracy of the neural network function approximation increases with
the number of neurons in the hidden layer.                         (︀ )︀
    When there are ℎ neurons the error could be estimated as 𝑂 ℎ1 . Since the
number of outputs in the network does not exceed, and typically much smaller
than the number of inputs, so the main number of weights in the two-layer
network would be concentrated in the first layer, i.e. 𝑤 = 𝑖 · ℎ, where 𝑖 - is the
input dimension. In this case, the average approximation(︀ error
                                                               )︀ is expressed by
the total number of weights in the network as follows: 𝑂 𝑤𝑖 .
    The network description is associated with the models complexity and is ba-
sically comes down to the consideration of the amount of information in the
transmission of its weights values through some communication channel. If we
accept the hypothesis 𝜓 about the network settings, its weights and the number
of neurons, the amount of information (in the absence of noise) while trans-
ferring the weights will be − log (𝑃 𝑟𝑜𝑏), where 𝑃 𝑟𝑜𝑏 is the probability of this
event before the message arrives at the receiver input. For a given accuracy this
description requires about − log 𝑃 (𝜓) ∼ 𝑤 bit. Therefore, a specific error for
one pattern associated with the complexity of the model could be estimated as:
∼ 𝑤𝑝 , where 𝑝 is the number of patterns in the training set. The error decreases
monotonically with increasing the number of patterns. So Haykin, using the
results from the work of Baum and Hessler, gives recommendations about the
volume of a training sample relative to the number of weighting coefficients and
taking into account the proportion of errors allowed during the test, which can
be expressed by the inequality: 𝑝 ≥ 𝑤𝜀 , where 𝜀 is the proportion of errors which
allowed during testing. Thus, when 10% of errors are acceptable then the num-
ber of training patterns must be 10 times greater than the number of available
weighting coefficients in the network.
    Thus, both components of the network generalization error from expression
(1) were considered. It is important that these components are differently depend
on the network size (number of weights), which implies the possibility of choosing
the optimal size that minimizes the total error:
                                                        √︃
                                           𝑖    𝑤          𝑖
                       𝐸𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 (1) ∼ + ≥ 2 ·              .                 (2)
                                           𝑤     𝑝         𝑝

Minimum error (equal√sign) is achieved with the optimal number of the weights
in the network: 𝑤 ∼ 𝑝 · 𝑖 which corresponds to the number of neurons in the
hidden layer:                            √︂
                                    𝑤       𝑝
                                ℎ=     ∼      ,                           (3)
                                     𝑖      𝑖
where 𝑖, ℎ are the quantity of neurons in input and hidden layers, 𝑤 is the
number of weights and 𝑝 is the amount of patterns in the training sample.

                                                                                      116
Mathematical and Information Technologies, MIT-2016 — Information technologies

3     Evaluation of the resulting architecture and parameters

After the network architecture was selected the learning process is carried out. In
particular, for a multi-layer perceptron this may be the error back-propagation
(BP) algorithm. One of the most serious problems is that the network is trained
to minimize the error on the training set, rather than an error that can be
expected from the network while it will process completely new patterns. Thus,
the training error will differ from the generalization error for the previously
unknown model of the phenomenon in the absence of the ideal and the infinitely
large training sample [2].
    Since the generalization error is defined for data, which are not included in the
training set, the solution may consist of separation of all the available data into
two sets: a training set, which is used to match the specific values to the weights,
and validation set, which evaluates the predictive ability of the network and
selects the models optimal complexity. The training process is commonly stops
with consideration of ”learning curves” which track dependencies of learning and
generalization errors according to the neural network size [3, 4]. The optimum
matches to local minima and points, where the graphs meet asymptote. Figure
1 shows the stop point, which corresponds to the validation error minimum
(dash-dotted line), while the training error (solid line) keeps going down.


                     Fig. 1. The stop training criterion at time t*.


    Another class of learning curves uses the dependencies of the neural network
internal properties to its size, and then mapped to an error of generalization.
For example, in [3] the analysis of the internal representation of the problem
being solved, relationship of the training error and the maximum sum of the
synapse weights modules attributable to the neuron of network are carried out.
Also there are variants of generalized curves, which are based on dependence of
the wave criterion from the neural network size [5] or perform a comparison of
the average module values of synapse weights [6].
    In simplified form, the following criteria could be formulated to assess the
already constructed neural network model:

117
             Mathematical and Information Technologies, MIT-2016 — Information technologies

 – if the training error is small, and testing error is large, it means that the
   network includes too much synapses;
 – if the training error and testing error is large, then the number of synapses
   is too small;
 – if all the synapse weights are too large, it means that there are too few
   synapses.
After evaluation of the neural network model, the decision-making process is
taking place about the necessity of changing the number of hidden elements
in one or another direction, and the learning process should be repeated. It is
worth to mention that modified decision trees, which are based on the first-
order predicate logic, could be applied as a means of decision making support
and neural network structure construction automation.


4     Rules extraction and results interpretation
Generally speaking, there are two approaches to extract rules from the multilayer
neural networks. The first approach is based on extraction of global rules that
characterize the output classes directly through the input parameter values. An
alternative is in extraction of local rules, separating the multilayer net on a set of
single-layer networks. Each extracted local rule characterizes a separate hidden
or output neuron based on weighted connections with other elements. Then rules
are combined into a set, which determines the behavior of the whole network.
    The NeuroRule algorithm is applicable for rules extraction from the trained
multilayer neural networks, such as perceptron. This algorithm performs the
network pruning and identifies the most important features. However, it sets
quite strict limitations on the architecture, the number of elements, connections
and type of activation functions. As an alternative approach may be highlighted
TREPAN type algorithms which extract structured knowledge not only of the
extremely simplified neural networks, but also arbitrary classifiers in the form of
a decision tree [7]. However, this approach does not take into account structural
features that can introduce additional information.
    The decision of such kind of problems could be based on the usage of the
modular neural network BP-SOM[8].


5     Modular neural network BP-SOM
5.1    Network architecture
The main idea is to increase the reactions similarity of the hidden elements
while processing the patterns from the sample, which belong to the same class.
The traditional architecture of the direct distribution network [1], in particular
the multi-layer perceptron with the back-propogation learning algoritm (BP),
combines with the Kohonen self-organizing maps (SOM) [9], where each hidden
layer of the perceptron network is associated with a certain self-organizing map.
The structure of such a network is shown in Figure 2.

                                                                                       118
Mathematical and Information Technologies, MIT-2016 — Information technologies


        Fig. 2. The modular neural network (BP-SIM) with one hidden layer.


5.2    Learning algorithm
BP-SOM learning algorithm largely corresponds a combination of algorithms
which are specific to the learning rules of its component parts [8]. First, the
initial vector from the training sample is fed to the input of the network and its
direct passage is carried out. At the same time, the result of neurons activation
in each hidden layer is used as a vector of input values for the corresponding
SOM network. Training of SOM components is carried out in the usual way
and ensures their self-organization. In further, this self-organization is used to
account classes, which tags are assigned to each element of the Kohonen maps.
For this purpose, a counting is taking place, which purpose is to get the number
of times the SOM-neuron became the winner and determine what class the initial
vector of training sample belongs to. The winner is chosen from the SOM-neuron,
whose weights vector is the closest in terms of the Euclidean distance measure
to the output values vector of the hidden layer neurons. The most common class
is taken as the mark. Reliability is calculated from the ratio of the number of
class mark occurrences to the total number of victories of the neuron, i.e. for
example, if the SOM-neuron became 4 times winner of class A and 2 times for
class B, class A label with certainty 4/6 is selected. The total accuracy of the
self-organizing map is equal to the average reliability of all elements of the card.
Also, SOM allows data visualizing and displays areas for the various classes (Fig.
2). Learning rule of the multilayer perceptron component part is carried out by
the similar to BackPropagation (BP) algorithm, minimizing aggregate square
error:                                  ∑︁
                             𝐵𝑃𝐸𝑟𝑟𝑜𝑟 =      (𝑑𝑖 − 𝑦𝑖 )2 ,                        (4)
                                              𝑖
where index 𝑖 runs through all outputs of the multi-layer network, 𝑑𝑖 is the
desired output of the neuron 𝑖, 𝑦𝑖 is the current output of the neuron 𝑖 from

119
             Mathematical and Information Technologies, MIT-2016 — Information technologies

the last layer. This error is transferred over the network in the opposite direc-
tion from the output to the hidden layers. Also, an additional error component
𝑆𝑂𝑀𝐸𝑟𝑟𝑜𝑟 for neurons from hidden layers is introduced, which is based on in-
formation about the particular class of input vector and taking into account the
self-organizing map data. Thus, in the SOM, which corresponds to the current
hidden layer, the searches for an special element 𝑉𝑆𝑂𝑀 is taking place. This
element should be closest, in terms of Euclidean distance, to the output vector
of the hidden layer 𝑉ℎ𝑖𝑑𝑑𝑒𝑛 , and be the same class label as the input vector. The
distance between the detected vector 𝑉𝑆𝑂𝑀 and the vector 𝑉ℎ𝑖𝑑𝑑𝑒𝑛 is taken as
the error value 𝑆𝑂𝑀𝐸𝑟𝑟𝑜𝑟 and accounted for all of the hidden layer neurons. If
the item 𝑉𝑆𝑂𝑀 is not found, then the error value 𝑆𝑂𝑀𝐸𝑟𝑟𝑜𝑟 is assumed to be
0. Thus, the total error for the neurons of the hidden layer takes the form:

            𝐵𝑃 𝑆𝑂𝑀𝐸𝑟𝑟𝑜𝑟 = (1 − 𝛼) · 𝐵𝑃𝐸𝑟𝑟𝑜𝑟 + 𝑟 · 𝛼 · 𝑆𝑂𝑀𝐸𝑟𝑟𝑜𝑟 ,                       (5)

where 𝐵𝑃𝐸𝑟𝑟𝑜𝑟 is the error of perceptron (from BackPropagation algorithm);
𝑆𝑂𝑀𝐸𝑟𝑟𝑜𝑟 is the error of the winner neuron from Kohonen network; 𝑟 is the
reliability factor of the winner neuron from Kohonen network; 𝛼 is the influence
coefficient of the Kohonen network errors (if the value is equal to 0, then it will
become original BP).
    The results of the Kohonen maps self-organization are used for changing of
the weight coefficients in the process of network training. This provides an effect,
in which the activation of neurons in the hidden layer will become more similar
to all other cases processing vectors of the same class [10]. The SOM with the
dimension 7 per 7 is shown in Figure 3.


Fig. 3. Coloring of the hidden layer for the Kohonen self-organizing maps, depending
on the learning algorithm.


    It characterizes the reaction of the hidden layer neurons of the BP-SOM
network, which was trained to solve the problem of classifying for two classes.
Map on the left corresponds to the base algorithm of back-propagation (BP), the
right - with the influence of the Kohonen map, i.e. BP-SOM training. Here white
cells correspond to class A, and black cells to class B. In turn, the size of the
shaded region of the cell determines the accuracy of the result. So, completely
white or completely black cell is characterized by the accuracy of 100% .

                                                                                       120
Mathematical and Information Technologies, MIT-2016 — Information technologies

   This could ensure structuring and visualization of information extracted from
data, improve the perception of the under study phenomenon and help in the
network architecture selection process. For example, if it is impossible to isolate
the areas at the SOM for the individual classes, then there are not enough
neurons and their number should be increased. Moreover, this approach can
simplify the rules extraction from already trained neural network and provide
the result in a hierarchical structure of the consistent rules such as ”if-then”.

5.3    Rules extraction
As an example, let’s consider a small test BP-SOM network that will be trained
to solve the classification problem which is defined by the next logical function
[11]:
          𝐹 (𝑥0 , 𝑥1 , 𝑥2 ) = (𝑥0 ∧ 𝑥¯1 ∧ 𝑥¯2 ) ∨ (𝑥¯0 ∧ 𝑥1 ∧ 𝑥¯2 ) ∨ (𝑥¯0 ∧ 𝑥¯1 ∧ 𝑥2 )
    This function is set to True (Class 1) only in case where one of the arguments
is True, otherwise the function value is False (Class 0). Two-layer neural network
could be used for implementation, which consists of three input elements, three
neurons in the hidden layer, and two neurons in the resulting output layer. The
dimension for the Kohonen map for the intermediate layer is 3×3 (Figure 4).


                    Fig. 4. Kohonen map, class labels and reliability.


    Four elements of the Kohonen map acquired class labels with certainty 1
and 5 elements were left without a label, and their reliability is equal to 0 after
training (Figure 4).
    One of the methods for rules extraction from such a neural network could be
the algorithm, which was designed for classification problems with digital inputs.
It consists of the following two steps [10,11]:
 1. Searching for such groups of patterns in the training set, which are poten-
    tially could be combined into individual subsets, each of which is connected
    to one element of the Kohonen map (Table 1).
 2. Then, each of the subsets is examined to identify the values of the inputs that
    have constant value in the subset. For example, in the subgroup associated
    with the element 𝑘1 all the attributes 𝑥0 , 𝑥1 and 𝑥2 have a constant value
    0, and for the element 𝑘3 value 1 respectively.

121
             Mathematical and Information Technologies, MIT-2016 — Information technologies

                            SOM element Class 𝑥0        𝑥1   𝑥2
                                𝑘1        0   0         0    0
                                𝑘3        0   1         1    1
                                              0         0    1
                                𝑘7        1   0         1    0
                                              1         0    0
                                              0         1    1
                                𝑘9        0   1         0    1
                                              1         1    0

Table 1. Splitting the training set into groups with the reference to a specific element
of the SOM.


   Thus, it is possible to get the following two rules from the Table 1:
 – 𝐼𝐹 (𝑥0 = 0 ∧ 𝑥1 = 0 ∧ 𝑥2 = 0)𝑇 𝐻𝐸𝑁 (𝐶𝑙𝑎𝑠𝑠 = 0);
 – 𝐼𝐹 (𝑥0 = 1 ∧ 𝑥1 = 1 ∧ 𝑥2 = 1)𝑇 𝐻𝐸𝑁 (𝐶𝑙𝑎𝑠𝑠 = 0);
   However, it is rather problematic to use this way to distinguish rules for
elements 𝑘7 and 𝑘9 . Of course, it is possible to compose a disjunction of all the
possible options for each SOM-element, but these rules would be complicated in
perception.
   Additional available information consists of the attributes sum for each pat-
terns from the training sample. It is easy to notice that each SOM-element is
responsible for a certain obtained sum value (Table 2).


                                         𝑘1   𝑘3   𝑘7   𝑘9
                                 Sum     0    3    1    2
                                 Class   0    0    1    0

Table 2. The sum of the attribute values for each patterns from the training sample
concerning the SOM-elements.


    These considerations could be summarized. We need to find constraints on the
attribute values of the input vectors and their weight coefficients for extraction
rules, which corresponds to Kohonen map elements. This may be done by back-
propagation of minimum and maximum values of the neuron activation back
to the previous layer, i.e. we have to apply the function 𝑓 −1 (𝑉𝑖 𝐶𝑢𝑟 ) (inverse to
activation function) to the output neuron value [12].
                               ∑︁                          ∑︁
     𝑓 −1 (𝑉𝑖 𝐶𝑢𝑟 ) = 𝑓 −1 (𝑓 (   𝑤𝑗𝑖 𝑉𝑗𝑃 𝑟𝑒𝑣 + 𝑏𝑖𝑎𝑠𝑖 )) =    𝑤𝑗𝑖 𝑉𝑗𝑃 𝑟𝑒𝑣 + 𝑏𝑖𝑎𝑠𝑖 , (6)
                             𝑗                               𝑗

where 𝑉𝑖 𝐶𝑢𝑟 is the 𝑖-th neuron output from the current layer, 𝑉𝑗 𝑃 𝑟𝑒𝑣 is the
output of the 𝑗-th neuron of the previous layer. Assuming that sigmoid was used

                                                                                       122
Mathematical and Information Technologies, MIT-2016 — Information technologies

as the neurons activation function, then in this case we will get:
                                                          1
                            𝑓 −1 (𝑉𝑖 𝐶𝑢𝑟 ) = − ln(        𝐶𝑢𝑟
                                                                − 1).            (7)
                                                     𝑉𝑖
    Additionally, it is known that self-organizing map elements, which are con-
nected to the elements of the first hidden layer of the perceptron, will respond to
proximity of the weight vectors and outputs of the hidden layer neurons. There-
fore, it is proposed to replace the back-propagation neural activation of the
hidden layer to the weight vector values of the self-organization map element
during the rules construction for each SOM-element. For example, if the hidden
layer contains neuron 𝐴, and the weight between this neuron and SOM-element
                      𝑘
𝑘 was denoted by 𝑤𝐴     , then the restriction takes the form:
                          ∑︁                  ∑︁
   𝑓 −1 (𝑤𝐴𝑘
             ) − 𝑏𝑖𝑎𝑠𝐴 ≈      𝑤𝑗𝐴 𝑥𝑗 𝑜𝑟 1 ≈ (     𝑤𝑗𝐴 𝑥𝑗 )/(𝑓 −1 (𝑤𝐴
                                                                   𝑘
                                                                     ) − 𝑏𝑖𝑎𝑠𝐴 ). (8)
                            𝑗                     𝑗

   Such restrictions, which were obtained for all the neurons of the hidden layer,
could be used for construction of following rules:
                                ∑︁
   𝐼𝐹 (∧𝑖 (𝑓 −1 (𝑤𝐴
                  𝑘
                    ) − 𝑏𝑖𝑎𝑠𝐴 ≈     𝑤𝑗𝐴 𝑥𝑗 )) 𝑇 𝐻𝐸𝑁 (𝐶𝑙𝑎𝑠𝑠 = 𝐶𝑙𝑎𝑠𝑠(𝑘)).        (9)
                                      𝑗

Similar rules are required for all SOM-components, the reliability of which ex-
ceeds a certain threshold.
    If we apply the considered method for the initial example, then four sets
of restrictions would be obtained. Each set includes three restrictions, which
corresponds to the number of neurons in the hidden layer of the perceptron.
    For SOM-element 𝑘1 :

 – 1 ≈ 687 * 𝑥0 + 687 * 𝑥1 + 687 * 𝑥2 ;
 – 1 ≈ 738 * 𝑥0 + 738 * 𝑥1 + 738 * 𝑥2 ;
 – 1 ≈ 1062 * 𝑥0 + 1062 * 𝑥1 + 1062 * 𝑥2 .

Taking into account that 𝑥0 , 𝑥1 , 𝑥2 ∈ {0, 1} then the best results may be achieved
if 𝑥0 = 0 , 𝑥1 = 0 and 𝑥2 = 0.
    For element 𝑘3 all restrictions coincide and have the form:

 – 1 ≈ 0.33 * 𝑥0 + 0.33 * 𝑥1 + 0.33 * 𝑥2 ;

Thus, the values are as follows: 𝑥0 = 1, 𝑥1 = 1, 𝑥2 = 1.
  For element 𝑘7 restrictions coincide and take the form:

 – 1 ≈ 𝑥0 + 𝑥1 + 𝑥2 ;

This corresponds to the case when only one of the attributes is equal to 1.
   Restrictions for element 𝑘9 :

 – 1 ≈ 0.5 * 𝑥0 + 0.5 * 𝑥1 + 0.5 * 𝑥2 ;

123
             Mathematical and Information Technologies, MIT-2016 — Information technologies

This condition characterizes the case when two of three attributes are 1.
   Thus, it is obtained the following set of rules, if all restrictions would be
generalized:

 – 𝐼𝐹 (𝑥0 + 𝑥1 + 𝑥2 ≈ 0) 𝑇 𝐻𝐸𝑁 (𝐶𝑙𝑎𝑠𝑠 = 0)
 – 𝐼𝐹 (𝑥0 + 𝑥1 + 𝑥2 ≈ 3) 𝑇 𝐻𝐸𝑁 (𝐶𝑙𝑎𝑠𝑠 = 0)
 – 𝐼𝐹 (𝑥0 + 𝑥1 + 𝑥2 ≈ 1) 𝑇 𝐻𝐸𝑁 (𝐶𝑙𝑎𝑠𝑠 = 1)
 – 𝐼𝐹 (𝑥0 + 𝑥1 + 𝑥2 ≈ 2) 𝑇 𝐻𝐸𝑁 (𝐶𝑙𝑎𝑠𝑠 = 0)

It is easy to notice that this set correctly describes all the elements of a self-
organizing network.


6    Conclusion

Approaches for the selection of the initial values of the neural network param-
eters were considered. The analysis of the training process, its stopping criteria
and evaluation of the received architecture were carried out. Combining differ-
ent neural network architectures, such as multi-layer perceptron with the back-
propagation learning algorithm, and Kohonen’s self-organizing maps, could bring
additional possibilities in the learning process and in the rules extraction from
trained neural network. Self-organizing maps are used both for information vi-
sualization, and for influencing the weights changes during the network training
process. That provides an effect, in which the neurons activation in the hidden
layer will become increasingly similar to all the cases processing vectors of the
same class. This ensures the extracted information structuring and has the main
purpose to improve the perception of the studied phenomena, assist in the pro-
cess of selecting the network architecture and simplify the extraction rules. The
results could be used for data processing and hidden patterns identification in
the information storage, which could become the basis for prognostic and design
solutions.


Acknowledgeents. Work is carried out with the financial support of the RFBR
(grant 15-07-01117-a).


References
1. Ezhov A. A., Shumskiy S. A. Neyrokomp’yuting i ego primeneniya v ekonomike i
   biznese. Moscow, MEPhI Publ., 1998. (in Russian)
2. Bishop C.M. Neural Networks for Pattern Recognition. Oxford Press. 1995.
3. Watanabe E., Shimizu H. Relationships between internal representation and gen-
   eralization ability in multi layered neural network for binary pattern classification
   problem /Proc. IJCNN 1993, Nagoya, Japan, 1993. Vol.2. pp. 1736-1739.
4. Cortes C., Jackel L. D., Solla S. A., Vapnik V., Denker J. S. Learning curves: asymp-
   totic values and rate of convergence / Advances in Neural Information Processing
   Systems 7 (1994). MIT Press, 1995. pp. 327-334.

                                                                                       124
Mathematical and Information Technologies, MIT-2016 — Information technologies

5. Lar’ko, A. A. Optimizaciya razmera nejroseti obratnogo rasprostraneniya. [Elec-
   tronic resource]. http://www.sciteclibrary.ru/rus/catalog/pages/8621.html.
6. Caregorodcev V.G. Opredelenie optimal’nogo razmera nejroseti obratnogo raspros-
   traneniya cherez sopostavlenie srednih znachenij modulej vesov sinapsov. /Materi-
   aly 14 mezhdunarodnoj konferencii po nejrokibernetike, Rostov-na-Donu, 2005. T.2.
   S.60-64. (in Russian)
7. Gridin V.N., Solodovnikov V.I., Evdokimov I.A., Filippkov S.V. Postroenie derev’ev
   reshenij i izvlechenie pravil iz obuchennyh nejronnyh setej / Iskusstvennyj intellekt
   i prinyatie reshenij 2013. 4 Str. 26-33. (in Russian)
8. Weijters, A. The BP-SOM architecture and learning rule. Neural Process-ing Let-
   ters, 2, 13-16, 1995.
9. T. Kohonen. Self-Organization and Associative Memory, Berlin: Springer Verlag,
   1989.
10. Ton Weijters, Antal van den Bosch, Jaap van den Herik Interpretable neural net-
   works with BP-SOM, Machine Learning: ECML-98, Lecture Notes in Computer
   Science Volume 1398, 1998, pp 406-411.
11. J.Eggermont, Rule-extraction and learning in the BP-SOM architecture, Thesis
   1998.
12. Sebastian B. Thrun. Extracting provably correct rules from artificial neural net-
   works. Technical Report IAI-TR-93-5, University of Bonn, Department of Computer
   Science, 1993.


125