Mathematical and Information Technologies, MIT-2016 — Information technologies Construction Features and Data Analysis by BP-SOM Modular Neural Network Vladimir Gridin and Vladimir Solodovnikov Design information technologies Center Russian Academy of Sciences, Odintsovo, Moscow region, Russia info@ditc.ras.ru http://ditc.ras.ru Abstract. Data are a valuable resource that keeps a great potential for recovery of the useful analytical information. One of the most promising toolkits to solve the problems of data mining could be the usage of neural network technology. The problem of initial parameters values and ways of neural network construction on example of a multilayer perceptron are considered. Also information about the task and available raw data are taken into account. The modular BP-SOM network, which combines the multi-layered feed-forward network with the Back-Propagation (BP) learning algorithm and Kohonens self-organising maps (SOM), is sug- gested for visualization of the internal information representation and the resulting architecture assessment. The features of BP-SOM func- tioning, methods of rule extraction from trained neural networks and the ways of the result interpretation are presented. Keywords: neural network, multilayer feedforward network, Kohonen self-organizing maps, modular network BP-SOM, rules extraction. 1 Introduction Data analysis processes are often related to the tasks which are characterized by the lack of information about the sample structure, dependencies and distribu- tions of analyzed indicators. One of the closest correspondences to this condition could be the usage of an approach based on neural network technology. The ability of neural networks to train, simulate nonlinear processes, deal with noisy data, extract and generalize the essential features from the incoming information makes them one of the most promising toolkits in solving data mining problems. However, there are several difficulties in using of this approach. In particular there is the problem of choosing an optimal network topology, parameter values and structural features that would best meet the problem being solved on avail- able raw data. This is due to the fact that different neural networks can show very similar results on the samples from training set and significantly different when working with new, never shown data. Designing the optimal topology of the neural network can be represented as a search of architecture that provides the best (relative to the chosen criterion) solution of a particular problem. Usu- ally the particular architecture and structural features could be selected on the 114 Mathematical and Information Technologies, MIT-2016 — Information technologies results of the assessment based on knowledge of the problem and the available source data. After that, the training and testing processes are taking place. Their results are used in the decision-making process, that the network meets all the requirements. Another complication in using neural network approach could be related with the results interpretation and its preconditions. Especially clearly, this problem appears for the multilayer perceptron (MLP) [1]. In fact, neural network acts as a ”black box”, where the source data are sent to the input neurons and the result is got from the output, but the explanation about the reasons of such a solution is not provided. The rules are contained in the weight coefficients, activation functions and connections between neurons, but usually their structure is too complex for understanding. Moreover, in the multilayer network, these param- eters may represent non-linear, non-monotonic relationship between input and target values. So, generally, it is not possible to distinguish the influence of a certain characteristic to the target value, because the effect is mediated by the values of other parameters. Also, you may experience some difficulties in using the learning algorithm of back-propagation BP, both with local extremes of the error function, and with the solutions of a certain class of problems. 2 The architecture and initial values choice for the neural network The choice in favor of neural network architecture can be based on knowledge of the problem being solved and the available source data, their dimension and the samples scope. There are different approaches for choosing the initial values of the neural network characteristics. For example, the ”Network Advisor” of the ST Neural Networks package offers by default one intermediate layer with the number of elements equals to the half of the sum of the quantity of inputs and outputs for the multilayer perceptron. In general, the problem of choosing the number of hidden elements for the multilayer perceptron should account two opposite properties, on the one hand, the number of elements should be adequate for the task, and on the other, should not be too large to provide the necessary generalization capability and avoid overfitting. In addition, the number of hidden units is dependent on the complexity of the function, which should be reproduced by the neural network, but this function is not known in advance. It should be noted that while the number of elements increases, the required number of observations also increases. As an estimate, it is possible to use the principle of joint optimization of the empirical error and the complexity of the model, which takes the following form [1]: min{𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑒𝑟𝑟𝑜𝑟 + 𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑚𝑜𝑑𝑒𝑙}. (1) The first part of this expression is responsible for the accuracy of the approxima- tion and the observed learning error. The less it is - the less bits it is needed to correct the model predictions. If the model predicts the data accurately, then the 115 Mathematical and Information Technologies, MIT-2016 — Information technologies length of the error description equals to zero. The second part makes sense to the amount of information needed to select a specific model from the set of all pos- sible. Its accounting allows to apply the necessary constraints on the complexity of the model by suppressing an excessive amount of tuning parameters. The accuracy of the neural network function approximation increases with the number of neurons in the hidden layer. (︀ )︀ When there are ℎ neurons the error could be estimated as 𝑂 ℎ1 . Since the number of outputs in the network does not exceed, and typically much smaller than the number of inputs, so the main number of weights in the two-layer network would be concentrated in the first layer, i.e. 𝑤 = 𝑖 · ℎ, where 𝑖 - is the input dimension. In this case, the average approximation(︀ error )︀ is expressed by the total number of weights in the network as follows: 𝑂 𝑤𝑖 . The network description is associated with the models complexity and is ba- sically comes down to the consideration of the amount of information in the transmission of its weights values through some communication channel. If we accept the hypothesis 𝜓 about the network settings, its weights and the number of neurons, the amount of information (in the absence of noise) while trans- ferring the weights will be − log (𝑃 𝑟𝑜𝑏), where 𝑃 𝑟𝑜𝑏 is the probability of this event before the message arrives at the receiver input. For a given accuracy this description requires about − log 𝑃 (𝜓) ∼ 𝑤 bit. Therefore, a specific error for one pattern associated with the complexity of the model could be estimated as: ∼ 𝑤𝑝 , where 𝑝 is the number of patterns in the training set. The error decreases monotonically with increasing the number of patterns. So Haykin, using the results from the work of Baum and Hessler, gives recommendations about the volume of a training sample relative to the number of weighting coefficients and taking into account the proportion of errors allowed during the test, which can be expressed by the inequality: 𝑝 ≥ 𝑤𝜀 , where 𝜀 is the proportion of errors which allowed during testing. Thus, when 10% of errors are acceptable then the num- ber of training patterns must be 10 times greater than the number of available weighting coefficients in the network. Thus, both components of the network generalization error from expression (1) were considered. It is important that these components are differently depend on the network size (number of weights), which implies the possibility of choosing the optimal size that minimizes the total error: √︃ 𝑖 𝑤 𝑖 𝐸𝑥𝑝𝑟𝑒𝑠𝑠𝑖𝑜𝑛 (1) ∼ + ≥ 2 · . (2) 𝑤 𝑝 𝑝 Minimum error (equal√sign) is achieved with the optimal number of the weights in the network: 𝑤 ∼ 𝑝 · 𝑖 which corresponds to the number of neurons in the hidden layer: √︂ 𝑤 𝑝 ℎ= ∼ , (3) 𝑖 𝑖 where 𝑖, ℎ are the quantity of neurons in input and hidden layers, 𝑤 is the number of weights and 𝑝 is the amount of patterns in the training sample. 116 Mathematical and Information Technologies, MIT-2016 — Information technologies 3 Evaluation of the resulting architecture and parameters After the network architecture was selected the learning process is carried out. In particular, for a multi-layer perceptron this may be the error back-propagation (BP) algorithm. One of the most serious problems is that the network is trained to minimize the error on the training set, rather than an error that can be expected from the network while it will process completely new patterns. Thus, the training error will differ from the generalization error for the previously unknown model of the phenomenon in the absence of the ideal and the infinitely large training sample [2]. Since the generalization error is defined for data, which are not included in the training set, the solution may consist of separation of all the available data into two sets: a training set, which is used to match the specific values to the weights, and validation set, which evaluates the predictive ability of the network and selects the models optimal complexity. The training process is commonly stops with consideration of ”learning curves” which track dependencies of learning and generalization errors according to the neural network size [3, 4]. The optimum matches to local minima and points, where the graphs meet asymptote. Figure 1 shows the stop point, which corresponds to the validation error minimum (dash-dotted line), while the training error (solid line) keeps going down. Fig. 1. The stop training criterion at time t*. Another class of learning curves uses the dependencies of the neural network internal properties to its size, and then mapped to an error of generalization. For example, in [3] the analysis of the internal representation of the problem being solved, relationship of the training error and the maximum sum of the synapse weights modules attributable to the neuron of network are carried out. Also there are variants of generalized curves, which are based on dependence of the wave criterion from the neural network size [5] or perform a comparison of the average module values of synapse weights [6]. In simplified form, the following criteria could be formulated to assess the already constructed neural network model: 117 Mathematical and Information Technologies, MIT-2016 — Information technologies – if the training error is small, and testing error is large, it means that the network includes too much synapses; – if the training error and testing error is large, then the number of synapses is too small; – if all the synapse weights are too large, it means that there are too few synapses. After evaluation of the neural network model, the decision-making process is taking place about the necessity of changing the number of hidden elements in one or another direction, and the learning process should be repeated. It is worth to mention that modified decision trees, which are based on the first- order predicate logic, could be applied as a means of decision making support and neural network structure construction automation. 4 Rules extraction and results interpretation Generally speaking, there are two approaches to extract rules from the multilayer neural networks. The first approach is based on extraction of global rules that characterize the output classes directly through the input parameter values. An alternative is in extraction of local rules, separating the multilayer net on a set of single-layer networks. Each extracted local rule characterizes a separate hidden or output neuron based on weighted connections with other elements. Then rules are combined into a set, which determines the behavior of the whole network. The NeuroRule algorithm is applicable for rules extraction from the trained multilayer neural networks, such as perceptron. This algorithm performs the network pruning and identifies the most important features. However, it sets quite strict limitations on the architecture, the number of elements, connections and type of activation functions. As an alternative approach may be highlighted TREPAN type algorithms which extract structured knowledge not only of the extremely simplified neural networks, but also arbitrary classifiers in the form of a decision tree [7]. However, this approach does not take into account structural features that can introduce additional information. The decision of such kind of problems could be based on the usage of the modular neural network BP-SOM[8]. 5 Modular neural network BP-SOM 5.1 Network architecture The main idea is to increase the reactions similarity of the hidden elements while processing the patterns from the sample, which belong to the same class. The traditional architecture of the direct distribution network [1], in particular the multi-layer perceptron with the back-propogation learning algoritm (BP), combines with the Kohonen self-organizing maps (SOM) [9], where each hidden layer of the perceptron network is associated with a certain self-organizing map. The structure of such a network is shown in Figure 2. 118 Mathematical and Information Technologies, MIT-2016 — Information technologies Fig. 2. The modular neural network (BP-SIM) with one hidden layer. 5.2 Learning algorithm BP-SOM learning algorithm largely corresponds a combination of algorithms which are specific to the learning rules of its component parts [8]. First, the initial vector from the training sample is fed to the input of the network and its direct passage is carried out. At the same time, the result of neurons activation in each hidden layer is used as a vector of input values for the corresponding SOM network. Training of SOM components is carried out in the usual way and ensures their self-organization. In further, this self-organization is used to account classes, which tags are assigned to each element of the Kohonen maps. For this purpose, a counting is taking place, which purpose is to get the number of times the SOM-neuron became the winner and determine what class the initial vector of training sample belongs to. The winner is chosen from the SOM-neuron, whose weights vector is the closest in terms of the Euclidean distance measure to the output values vector of the hidden layer neurons. The most common class is taken as the mark. Reliability is calculated from the ratio of the number of class mark occurrences to the total number of victories of the neuron, i.e. for example, if the SOM-neuron became 4 times winner of class A and 2 times for class B, class A label with certainty 4/6 is selected. The total accuracy of the self-organizing map is equal to the average reliability of all elements of the card. Also, SOM allows data visualizing and displays areas for the various classes (Fig. 2). Learning rule of the multilayer perceptron component part is carried out by the similar to BackPropagation (BP) algorithm, minimizing aggregate square error: ∑︁ 𝐵𝑃𝐸𝑟𝑟𝑜𝑟 = (𝑑𝑖 − 𝑦𝑖 )2 , (4) 𝑖 where index 𝑖 runs through all outputs of the multi-layer network, 𝑑𝑖 is the desired output of the neuron 𝑖, 𝑦𝑖 is the current output of the neuron 𝑖 from 119 Mathematical and Information Technologies, MIT-2016 — Information technologies the last layer. This error is transferred over the network in the opposite direc- tion from the output to the hidden layers. Also, an additional error component 𝑆𝑂𝑀𝐸𝑟𝑟𝑜𝑟 for neurons from hidden layers is introduced, which is based on in- formation about the particular class of input vector and taking into account the self-organizing map data. Thus, in the SOM, which corresponds to the current hidden layer, the searches for an special element 𝑉𝑆𝑂𝑀 is taking place. This element should be closest, in terms of Euclidean distance, to the output vector of the hidden layer 𝑉ℎ𝑖𝑑𝑑𝑒𝑛 , and be the same class label as the input vector. The distance between the detected vector 𝑉𝑆𝑂𝑀 and the vector 𝑉ℎ𝑖𝑑𝑑𝑒𝑛 is taken as the error value 𝑆𝑂𝑀𝐸𝑟𝑟𝑜𝑟 and accounted for all of the hidden layer neurons. If the item 𝑉𝑆𝑂𝑀 is not found, then the error value 𝑆𝑂𝑀𝐸𝑟𝑟𝑜𝑟 is assumed to be 0. Thus, the total error for the neurons of the hidden layer takes the form: 𝐵𝑃 𝑆𝑂𝑀𝐸𝑟𝑟𝑜𝑟 = (1 − 𝛼) · 𝐵𝑃𝐸𝑟𝑟𝑜𝑟 + 𝑟 · 𝛼 · 𝑆𝑂𝑀𝐸𝑟𝑟𝑜𝑟 , (5) where 𝐵𝑃𝐸𝑟𝑟𝑜𝑟 is the error of perceptron (from BackPropagation algorithm); 𝑆𝑂𝑀𝐸𝑟𝑟𝑜𝑟 is the error of the winner neuron from Kohonen network; 𝑟 is the reliability factor of the winner neuron from Kohonen network; 𝛼 is the influence coefficient of the Kohonen network errors (if the value is equal to 0, then it will become original BP). The results of the Kohonen maps self-organization are used for changing of the weight coefficients in the process of network training. This provides an effect, in which the activation of neurons in the hidden layer will become more similar to all other cases processing vectors of the same class [10]. The SOM with the dimension 7 per 7 is shown in Figure 3. Fig. 3. Coloring of the hidden layer for the Kohonen self-organizing maps, depending on the learning algorithm. It characterizes the reaction of the hidden layer neurons of the BP-SOM network, which was trained to solve the problem of classifying for two classes. Map on the left corresponds to the base algorithm of back-propagation (BP), the right - with the influence of the Kohonen map, i.e. BP-SOM training. Here white cells correspond to class A, and black cells to class B. In turn, the size of the shaded region of the cell determines the accuracy of the result. So, completely white or completely black cell is characterized by the accuracy of 100% . 120 Mathematical and Information Technologies, MIT-2016 — Information technologies This could ensure structuring and visualization of information extracted from data, improve the perception of the under study phenomenon and help in the network architecture selection process. For example, if it is impossible to isolate the areas at the SOM for the individual classes, then there are not enough neurons and their number should be increased. Moreover, this approach can simplify the rules extraction from already trained neural network and provide the result in a hierarchical structure of the consistent rules such as ”if-then”. 5.3 Rules extraction As an example, let’s consider a small test BP-SOM network that will be trained to solve the classification problem which is defined by the next logical function [11]: 𝐹 (𝑥0 , 𝑥1 , 𝑥2 ) = (𝑥0 ∧ 𝑥¯1 ∧ 𝑥¯2 ) ∨ (𝑥¯0 ∧ 𝑥1 ∧ 𝑥¯2 ) ∨ (𝑥¯0 ∧ 𝑥¯1 ∧ 𝑥2 ) This function is set to True (Class 1) only in case where one of the arguments is True, otherwise the function value is False (Class 0). Two-layer neural network could be used for implementation, which consists of three input elements, three neurons in the hidden layer, and two neurons in the resulting output layer. The dimension for the Kohonen map for the intermediate layer is 3×3 (Figure 4). Fig. 4. Kohonen map, class labels and reliability. Four elements of the Kohonen map acquired class labels with certainty 1 and 5 elements were left without a label, and their reliability is equal to 0 after training (Figure 4). One of the methods for rules extraction from such a neural network could be the algorithm, which was designed for classification problems with digital inputs. It consists of the following two steps [10,11]: 1. Searching for such groups of patterns in the training set, which are poten- tially could be combined into individual subsets, each of which is connected to one element of the Kohonen map (Table 1). 2. Then, each of the subsets is examined to identify the values of the inputs that have constant value in the subset. For example, in the subgroup associated with the element 𝑘1 all the attributes 𝑥0 , 𝑥1 and 𝑥2 have a constant value 0, and for the element 𝑘3 value 1 respectively. 121 Mathematical and Information Technologies, MIT-2016 — Information technologies SOM element Class 𝑥0 𝑥1 𝑥2 𝑘1 0 0 0 0 𝑘3 0 1 1 1 0 0 1 𝑘7 1 0 1 0 1 0 0 0 1 1 𝑘9 0 1 0 1 1 1 0 Table 1. Splitting the training set into groups with the reference to a specific element of the SOM. Thus, it is possible to get the following two rules from the Table 1: – 𝐼𝐹 (𝑥0 = 0 ∧ 𝑥1 = 0 ∧ 𝑥2 = 0)𝑇 𝐻𝐸𝑁 (𝐶𝑙𝑎𝑠𝑠 = 0); – 𝐼𝐹 (𝑥0 = 1 ∧ 𝑥1 = 1 ∧ 𝑥2 = 1)𝑇 𝐻𝐸𝑁 (𝐶𝑙𝑎𝑠𝑠 = 0); However, it is rather problematic to use this way to distinguish rules for elements 𝑘7 and 𝑘9 . Of course, it is possible to compose a disjunction of all the possible options for each SOM-element, but these rules would be complicated in perception. Additional available information consists of the attributes sum for each pat- terns from the training sample. It is easy to notice that each SOM-element is responsible for a certain obtained sum value (Table 2). 𝑘1 𝑘3 𝑘7 𝑘9 Sum 0 3 1 2 Class 0 0 1 0 Table 2. The sum of the attribute values for each patterns from the training sample concerning the SOM-elements. These considerations could be summarized. We need to find constraints on the attribute values of the input vectors and their weight coefficients for extraction rules, which corresponds to Kohonen map elements. This may be done by back- propagation of minimum and maximum values of the neuron activation back to the previous layer, i.e. we have to apply the function 𝑓 −1 (𝑉𝑖 𝐶𝑢𝑟 ) (inverse to activation function) to the output neuron value [12]. ∑︁ ∑︁ 𝑓 −1 (𝑉𝑖 𝐶𝑢𝑟 ) = 𝑓 −1 (𝑓 ( 𝑤𝑗𝑖 𝑉𝑗𝑃 𝑟𝑒𝑣 + 𝑏𝑖𝑎𝑠𝑖 )) = 𝑤𝑗𝑖 𝑉𝑗𝑃 𝑟𝑒𝑣 + 𝑏𝑖𝑎𝑠𝑖 , (6) 𝑗 𝑗 where 𝑉𝑖 𝐶𝑢𝑟 is the 𝑖-th neuron output from the current layer, 𝑉𝑗 𝑃 𝑟𝑒𝑣 is the output of the 𝑗-th neuron of the previous layer. Assuming that sigmoid was used 122 Mathematical and Information Technologies, MIT-2016 — Information technologies as the neurons activation function, then in this case we will get: 1 𝑓 −1 (𝑉𝑖 𝐶𝑢𝑟 ) = − ln( 𝐶𝑢𝑟 − 1). (7) 𝑉𝑖 Additionally, it is known that self-organizing map elements, which are con- nected to the elements of the first hidden layer of the perceptron, will respond to proximity of the weight vectors and outputs of the hidden layer neurons. There- fore, it is proposed to replace the back-propagation neural activation of the hidden layer to the weight vector values of the self-organization map element during the rules construction for each SOM-element. For example, if the hidden layer contains neuron 𝐴, and the weight between this neuron and SOM-element 𝑘 𝑘 was denoted by 𝑤𝐴 , then the restriction takes the form: ∑︁ ∑︁ 𝑓 −1 (𝑤𝐴𝑘 ) − 𝑏𝑖𝑎𝑠𝐴 ≈ 𝑤𝑗𝐴 𝑥𝑗 𝑜𝑟 1 ≈ ( 𝑤𝑗𝐴 𝑥𝑗 )/(𝑓 −1 (𝑤𝐴 𝑘 ) − 𝑏𝑖𝑎𝑠𝐴 ). (8) 𝑗 𝑗 Such restrictions, which were obtained for all the neurons of the hidden layer, could be used for construction of following rules: ∑︁ 𝐼𝐹 (∧𝑖 (𝑓 −1 (𝑤𝐴 𝑘 ) − 𝑏𝑖𝑎𝑠𝐴 ≈ 𝑤𝑗𝐴 𝑥𝑗 )) 𝑇 𝐻𝐸𝑁 (𝐶𝑙𝑎𝑠𝑠 = 𝐶𝑙𝑎𝑠𝑠(𝑘)). (9) 𝑗 Similar rules are required for all SOM-components, the reliability of which ex- ceeds a certain threshold. If we apply the considered method for the initial example, then four sets of restrictions would be obtained. Each set includes three restrictions, which corresponds to the number of neurons in the hidden layer of the perceptron. For SOM-element 𝑘1 : – 1 ≈ 687 * 𝑥0 + 687 * 𝑥1 + 687 * 𝑥2 ; – 1 ≈ 738 * 𝑥0 + 738 * 𝑥1 + 738 * 𝑥2 ; – 1 ≈ 1062 * 𝑥0 + 1062 * 𝑥1 + 1062 * 𝑥2 . Taking into account that 𝑥0 , 𝑥1 , 𝑥2 ∈ {0, 1} then the best results may be achieved if 𝑥0 = 0 , 𝑥1 = 0 and 𝑥2 = 0. For element 𝑘3 all restrictions coincide and have the form: – 1 ≈ 0.33 * 𝑥0 + 0.33 * 𝑥1 + 0.33 * 𝑥2 ; Thus, the values are as follows: 𝑥0 = 1, 𝑥1 = 1, 𝑥2 = 1. For element 𝑘7 restrictions coincide and take the form: – 1 ≈ 𝑥0 + 𝑥1 + 𝑥2 ; This corresponds to the case when only one of the attributes is equal to 1. Restrictions for element 𝑘9 : – 1 ≈ 0.5 * 𝑥0 + 0.5 * 𝑥1 + 0.5 * 𝑥2 ; 123 Mathematical and Information Technologies, MIT-2016 — Information technologies This condition characterizes the case when two of three attributes are 1. Thus, it is obtained the following set of rules, if all restrictions would be generalized: – 𝐼𝐹 (𝑥0 + 𝑥1 + 𝑥2 ≈ 0) 𝑇 𝐻𝐸𝑁 (𝐶𝑙𝑎𝑠𝑠 = 0) – 𝐼𝐹 (𝑥0 + 𝑥1 + 𝑥2 ≈ 3) 𝑇 𝐻𝐸𝑁 (𝐶𝑙𝑎𝑠𝑠 = 0) – 𝐼𝐹 (𝑥0 + 𝑥1 + 𝑥2 ≈ 1) 𝑇 𝐻𝐸𝑁 (𝐶𝑙𝑎𝑠𝑠 = 1) – 𝐼𝐹 (𝑥0 + 𝑥1 + 𝑥2 ≈ 2) 𝑇 𝐻𝐸𝑁 (𝐶𝑙𝑎𝑠𝑠 = 0) It is easy to notice that this set correctly describes all the elements of a self- organizing network. 6 Conclusion Approaches for the selection of the initial values of the neural network param- eters were considered. The analysis of the training process, its stopping criteria and evaluation of the received architecture were carried out. Combining differ- ent neural network architectures, such as multi-layer perceptron with the back- propagation learning algorithm, and Kohonen’s self-organizing maps, could bring additional possibilities in the learning process and in the rules extraction from trained neural network. Self-organizing maps are used both for information vi- sualization, and for influencing the weights changes during the network training process. That provides an effect, in which the neurons activation in the hidden layer will become increasingly similar to all the cases processing vectors of the same class. This ensures the extracted information structuring and has the main purpose to improve the perception of the studied phenomena, assist in the pro- cess of selecting the network architecture and simplify the extraction rules. The results could be used for data processing and hidden patterns identification in the information storage, which could become the basis for prognostic and design solutions. Acknowledgeents. Work is carried out with the financial support of the RFBR (grant 15-07-01117-a). References 1. Ezhov A. A., Shumskiy S. A. Neyrokomp’yuting i ego primeneniya v ekonomike i biznese. Moscow, MEPhI Publ., 1998. (in Russian) 2. Bishop C.M. Neural Networks for Pattern Recognition. Oxford Press. 1995. 3. Watanabe E., Shimizu H. Relationships between internal representation and gen- eralization ability in multi layered neural network for binary pattern classification problem /Proc. IJCNN 1993, Nagoya, Japan, 1993. Vol.2. pp. 1736-1739. 4. Cortes C., Jackel L. D., Solla S. A., Vapnik V., Denker J. S. Learning curves: asymp- totic values and rate of convergence / Advances in Neural Information Processing Systems 7 (1994). MIT Press, 1995. pp. 327-334. 124 Mathematical and Information Technologies, MIT-2016 — Information technologies 5. Lar’ko, A. A. Optimizaciya razmera nejroseti obratnogo rasprostraneniya. [Elec- tronic resource]. http://www.sciteclibrary.ru/rus/catalog/pages/8621.html. 6. Caregorodcev V.G. Opredelenie optimal’nogo razmera nejroseti obratnogo raspros- traneniya cherez sopostavlenie srednih znachenij modulej vesov sinapsov. /Materi- aly 14 mezhdunarodnoj konferencii po nejrokibernetike, Rostov-na-Donu, 2005. T.2. S.60-64. (in Russian) 7. Gridin V.N., Solodovnikov V.I., Evdokimov I.A., Filippkov S.V. Postroenie derev’ev reshenij i izvlechenie pravil iz obuchennyh nejronnyh setej / Iskusstvennyj intellekt i prinyatie reshenij 2013. 4 Str. 26-33. (in Russian) 8. Weijters, A. The BP-SOM architecture and learning rule. Neural Process-ing Let- ters, 2, 13-16, 1995. 9. T. Kohonen. Self-Organization and Associative Memory, Berlin: Springer Verlag, 1989. 10. Ton Weijters, Antal van den Bosch, Jaap van den Herik Interpretable neural net- works with BP-SOM, Machine Learning: ECML-98, Lecture Notes in Computer Science Volume 1398, 1998, pp 406-411. 11. J.Eggermont, Rule-extraction and learning in the BP-SOM architecture, Thesis 1998. 12. Sebastian B. Thrun. Extracting provably correct rules from artificial neural net- works. Technical Report IAI-TR-93-5, University of Bonn, Department of Computer Science, 1993. 125