Coordinator Synthesis for Hierarchical Structure of Artificial Neural Network Stanislaw Placzek Engineering Department Vistula University Stoklosy Street, 02-787 Warsaw, Poland stanislaw.placzek@wp.pl Abstract. Two important issues have to be dealt with when implement- ing the hierarchical structure [1] of the learning algorithm of an Artificial Neural Network (ANN). The first one concerns the selection of the gen- eral coordination principle. Three different principles are described. They vary with regard to the degree of freedom for first-level tasks. The sec- ond issue concerns the coordinator structure or coordination algorithm. The ANN learning process can be examined as a two-level optimization problem. Importantly all problems and sub-problems are unstructured minimization tasks . The article concentrates on the issue of the coor- dinator structure. Using the interaction prediction principle as the most suitable principle for two-level ANN structures, different coordinator tar- get functions are defined. Using classification task examples, the main dynamic characteristics of the learning process quality are shown and analyzed. Keywords: Artificial Neural Network (ANN), hierarchy, decomposition, coor- dination, coordination principle, coordinator structure 1 Computational task complexity Large-scale multidimensional classification, interpolation and extrapolation are complex tasks that can require long calculation times. For these tasks one can make use of the ANN learning process using input and output data vectors with a defined network architecture. There is no theoretical solution to the architecture selection process, including the definition of the number of hidden layers and neuron distribution between layers. From a calculation point of view, the ANN learning process approaches a local or a global minimum asymptotically and is very time-consuming. A multi-layered ANN with one input layer, a set of hidden layers and an output layer can be sectioned off [1]. Every layer has its own input and output vectors. For a standard two-layer network both the hidden layer and the output layer can be described as sub-networks. These sub-networks form the first-level of the hierarchical structure. So the network consists of two sub- networks, and the local target function for each of them is defined Φ = (Φ1 , Φ2 ). Similarly to the ANN structure decomposition, learning algorithm using error backpropagation can also be decomposed. It can be decomposed into: – The first-level task, searching for the minimum of the local target functions Φ = (Φ1 , Φ2 ) , – The second-level task, coordinating the first-level tasks. Unfortunately, the first level optimization tasks in such learning algorithm are non-linear. In practice, only standard procedures exist to solve these optimiza- tion problems . But in a two-level learning algorithm structure, the coordinator is not responsible for solving the global task. In [2], the most popular inter- action prediction principle was implemented. According to this principle, the coordinator plays an active role in the current ANN learning process. The main interaction prediction principle is shown in Fig. 1. With each iteration, the coor- Fig. 1. Interaction Prediction Principle dinator and all of the first-level sub-tasks interchange information. The first-level sub-tasks are optimal searching tasks. Usually, they look for the minimum of the Mean Squared Error (MSE). The coordination algorithm structure is primary related to the interaction prediction principle and two signals are used: primary discerning U γ = (U1γ , U2γ ) and the feedback signal ascending  = (1 , 2 ). The primary signals are known as coordination signals and are sent from the co- ordinator to all of the first-level sub-tasks. Thereby the coordinator assists in optimizing the first-level sub-tasks. The forecast coordination signals may not be precisely correct and the sub-tasks calculate their own value of the coordination signals. These signals are sent up into the coordinator. The coordinator then uses its own gradient method to calculate the new value of the coordination signals (improving upon its own previous estimate). The process can be continued until the coordinator and all the first-level sub-tasks have solved their own tasks. 1.1 Coordination aspects Coordination as an operation process is related to three types of decision prob- lems [3]. Finding a solution for: – The global task as the primary task for the entire ANN structure, – Local tasks as the decomposition’s result of the global target function, – Coordinator task that should be synthesized or find a solution procedure. In effect , three different specific problems have to be solved: 1. Coordinator Synthesis. Decompose the global target function Φ into two sub- tasks with their own local target functions Φ1 , Φ2 ; the upper-level coordinator task should be defined in such a way that all the sub-tasks are coordinated. 2. Modification Problem. In a two-level ANN learning algorithm; both the first- level tasks and the coordinator task are defined. Unfortunately, the ANN is not coordinated by the coordinator tasks. Before this problem can be dealt with, one should modify the first-level tasks in such a way that the coordinator coordinates the modified local target functions Φ = (Φ1 , Φ2 ). This can be formalized using the following predicate formula: (∃γ)(∃W )[P(W, Φ)andP(U, Ψ (U γ , ))] (1) Where: The predicate P(W, Φ) is true, if Φ is a problem (task) and W is one of its solution. Ψ (U, )- coordinator target function. The first-level tasks are coordinated with the coordination task when the coordination task has a solution, all the first-level sub-tasks also have the solution for same coor- dination input U γ . 3. Decomposition. Given only the global target task Φ for ANN - decompose this task Φ into the two sub-tasks Φ1 , Φ2 and find a coordinator structure and a coordination procedure. This is formalized as: (∃γ)(∃W )[P[W = (W1 , W2 ), (Φ1 (U ), Φ2 (U )andP(W, Ψ (X, Z, W ))]] (2) Where: X - input file, Z - learning file, W = (W1 , W2 ) - ANN’s weight coefficients. The first-level tasks are coordinated with a given global target task when the global task has a solution and as do some of the first-level tasks. The coordinator has to influence the first-level tasks in such a way that the resulting action guarantees the solution of the global target task. 1.2 Decomposition and coordination Using Fig. 1, one can define the set of target functions: – The global target function: 2 N 1 X Φ(W 1, W 2, Y, Z) = · (yk − zk )2 (3) 2 k=1 Where: Y [1 : N2 ] - the ANN output value, Z[1 : N2 ] - the the vector of teaching data, N2 - the number of output neurons. – The local target function Φ1 N1 N0 1X X Φ1 (W 1, X, U γ ) = f( W 1ij · xj ) − uγi )2 (4) 2 i=1 j=0 Where: U γ [1 : N1 ] - the coordination matrix as an input variable, N1 - the number of hidden neurons, N0 - the number of input neurons, f (∗) - a sigmoid function. – The local target function Φ2 . N2 N1 1X X Φ2 (W 2, Z, U γ ) = f( W 2ki · uγi ) − zk )2 (5) 2 i=0 k=1 Where: U γ [1 : N1 ] - the coordination matrix as an input variable, N2 - the number of output neurons, f (∗) - a sigmoid function Using (4), one can calculate the feedback signal 1i and the new value of matrix W1. XN0 1i = f ( W 1ij ẋj ) (6) j=0 ∂Φ1 = (v1i − uγi ) · f 0 · xj (7) ∂W 1ij ∂Φ1 W 1ij (n + 1) = W 1ij (n) − α1 · (8) ∂W 1ij For the second sub-network using (5), one can calculate the new value of 2i and the new value of matrix W 2ki . ∂Φ2 = (v2k − zk ) · f 0 · uγk (9) ∂W 1ki ∂Φ2 W 2ki (n + 1) = W 2ki (n) − α2 · (10) ∂W 1ki 1 N ∂Φ2 X γ = (v2k − zk ) · f 0 · W 2ki (11) ∂ui k=1 ∂Φ2 2i (n + 1) = uγi (n) − α3 · (12) ∂uγi Where: α1, α2, α3 - learning coefficients, uγi = u1γi = u2γi - coordination signals, 1i , 2i - feedback signals. To summarize the first-level includes two sub-networks and two optimization tasks. The first sub-network calculates the new coefficient matrix W 1ik (n + 1) and the feedback signal 1i value by taking the parameter u1γi from the coordinator and using the optimization procedure. The feedback signal is sent into the coordinator. For the second sub-network, the coordination signal u2γi sets the optimization procedure in motion and calculates the new coefficient matrix W 2ki (n + 1) and the feedback signal 2i value. The feedback signal is also sent into the coordinator. After that, the coordinator procedure has to calculate the new coordinator signal u1γi (n + 1) and u2γi (n + 1). Thus, design of the coordinator structure is the main problem in a two-level learning algorithm. 2 Coordinator structure In a two-level learning algorithm, the coordinator plays the main role. There- fore, the choice of the coordinator principle is paramount. The principle should specify strategies for the coordinator and determines the structure of the coor- dinator. Three different approaches to how the interaction could be performed were introduced [3].For an ANN learning algorithm, the Interaction Prediction Principle is the most suitable. According to it, the coordinator predicts the in- terface inputs. Success in coordinating all the first-level tasks depends on the accuracy of the prediction of the interface inputs or the effect of the prediction inputs. Therefore, to obtain responses from the first-level tasks, the coordinator could measure: 1. The interface inputs value. This principle will be known as the Interface Interaction Balance. 2. The target function value. This principle will be known as the Performance Prediction Principle. 2.1 Interface Interaction Balance The coordination input may involve a prediction of the interface between the first and the second sub-networks. For the first sub-network, the coordinator signal U1γ is sent into the target function Φ1 as a teacher data parameter. For the second sub-network, the coordinator signal U2γ is treated as an input variable. Thanks to it, both tasks are fully specified and algorithms can find the minimum value of their target functions Φ1 , Φ2 . The Interface Interaction Balance idea is shown in Fig. 2. The coordinator target function Ψ could be a linear or a nonlinear Fig. 2. Coordination algorithm structure for Interface Prediction Principle function. Ψ (U γ , γ ) = Ψ (U γ , U γ − Y ) (13) Equation (13) uses the linear relation between two feedback signals 1, 2 and the prediction interface value U γ . Where: U γ - prediction interface value Y - real interface value N1 N1 1X γ γ 2 1X Ψ= (u − 1i ) + (uγ − 2γi )2 (14) 2 i=1 i 2 i=1 i Where: U γ = [uγ1 , uγ2 ...uγN1 ] Using formula (14), the first derivative is calculated ∂Ψ = (uγi − 1γi ) + (uγi − 2γi ) (15) ∂uγi ∂Ψ uγi (n + 1) = uγi (n) − λ1 · (16) ∂uγi Where: λ1 - learning coefficient, The coordinator and the first-level sub-networks work in an iterative scheme. When the coordinator signal U γ (n) is applied and the first-level optimization tasks find their own solution, a new coordination signal can be calculated (16). Using the Interface Interaction Balance, the two vectors signals are measured: the predicted interface input U γ and the real interface value Y1 , Y2 (Fig.2). In a real situation this requirement could be difficult to implement.It is usually possible to measure the first-level tasks and to send into the coordinator the target function (performance) Φ1 , Φ2 value and their derivatives. 2.2 Linear Performance Balance Principle The target functions values and their derivatives are of a more generalized form.The coordination scheme that uses this idea is shown in Fig. 3. The coor- Fig. 3. Coordination algorithm structure for Performance Prediction Principle dinator can receive both the target function value and the derivative value from the first-level sub-systems . The coordinator function can be defined as a linear or a nonlinear function of the two parameters Φ1, Φ2 Ψ = Ψ (Φ1, Φ2) (17) Then the local target functions can be defined by : N 1 1X Φ1(Y 1, W 1, U γ ) = (y1i − uγi )2 (18) 2 i=1 N 2 1X Φ2(Y 2, W 2, Z) = (y2k − zk )2 (19) 2 k=1 In this subsection, we examine the coordinator function (17) as linear relation: Ψ = Φ1 + Φ2 (20) Using (18) and (19) the first partial derivatives are calculated: ∂Φ1 = (y1i − uγi ) (21) ∂uγi 2 N ∂Φ2 X γ = (y2k − zk ) · f 0 · W 2ki (22) ∂ui k=1 Using the same gradient algorithm for the coordinator as above, the new coor- dination signal U γ (n + 1) is calculated using the formula ∂Ψ uγi (n + 1) = uγi (n) − λ1 · (23) ∂uγi The global target function Ψ is a nonlinear function that should take into account the sigmoid activation functions for both the first and the second sub-network. Because of this, one can say that the coordination function described by formula (20) is simplistic and does not consider the non - linearity of the coordinator structure. 2.3 Nonlinear Performance Balance Principle In the article, non - linearity will be demonstrated by two coordinator target functions Ψm = Φ1 + Φ2 + c · Φ1 · Φ2 (24) Ψp = Φ1 + Φ2 + c · (Φ12 + Φ22 ) (25) Where: Ψm - indicates non-linear multiplication part, Ψp - indicates non-linear sum of power part. parameter ”c” describes the impact of the non-linear part on the coordinator algorithm. The structure of target functions Φ1 , Φ2 are the same as in formula (18) and (19), respectively. The final formulas for derivatives have more complicated structures   ∂Ψm ∂Φ1 ∂Φ2 ∂Φ1 ∂Φ2 = + +c· · Φ2 + · Φ1 (26) ∂uγi ∂uγi ∂uγi ∂uγi ∂uγi   ∂Ψp ∂Φ1 ∂Φ2 ∂Φ1 ∂Φ2 = + + 2 · c · · Φ1 + · Φ2 (27) ∂uγi ∂uγi ∂uγi ∂uγi ∂uγi In practice, calculation developers are obliged to find the optimal ”c” value. This parameter will have a significant impact on the quality and stability of the coordinator learning process. 3 Classification task example The main dynamic characteristics of the learning process can be shown using the following example. Emphasis is put on the characteristics of the first-level local target functions, Φ1 , Φ2 , and the second level, coordinator target function Ψ . Optimal ANN learning characteristics depend on two connected tasks. Firstly, a network structure that includes a number of hidden layers needs to be cre- ated.Secondly, neurons need to be distributed between layers. A single hidden layer is chosen in this example. The structure of the ANN is simple and can be described as ANN (N0 − N1 − N2 ). In literature, one can only find sug- gestions regarding optimal numbers of neurons using the Vapnik - Cervonenkis dimension [4][5]. However, the hidden layer structure could also be determined using Kolmogorov’s theorem [4][6]. For an ANN with one hidden layer, a sigmoid activation function can be chosen for classification tasks and N0 input neurons, V Cdim = N0 + 1 (28) Therefore, we can use this measure to define the number of neurons in the hidden layers N1 = V Cdim (29) For a continuous function with N0 input vector dimension and N2 output vector, the number of neurons according to Kolmogorov’s theorem can be calculated as N1 = 2 · N0 + 1 (30) In practice, the number of neurons in the hidden layer will be chosen according to the formula N0 + 1 ≤ N1 ≤ 2 · N0 + 1 (31) Sigmoid activation functions are implemented in both the hidden and output layers. For the classification task using 6 - dimension input vectors N0 = 6 and N2 = 1, the number of hidden neurons is calculated according to formula 7 ≤ N1 ≤ 13 (32) 3.1 Example for Interface Interaction Balance The quality of the leaning process depends on the key learning parameters for the first-level as well as the coordinator level. Fig. 4 shows the dynamic charac- teristics of the learning process. Fig. 4. The first sub-networks learning error. λ1 = 0.5 Fig. 5. The coordinator learning error. λ = 0.2 The main learning parameters, α1 = 0.3, α2 = 0.3, guarantee that every sub-task can find its own minimum target function value. The coordinator has its own algorithm described by equation (23). If parameter λ1 is too large, the learning process is not stable, especially at the end of the iterative process. The first sub-network includes 6 input neurons and 13 output neurons. Matrix W1 includes 7*13 = 91 neurons, but matrix W2 only 14*1=14 neurons. At each stage the first sub-network calculates a lower target function Φ1 as the second sub-network. In the middle and the final part of the iterative process, the characteristics are the same. This means that the sub-networks achieved only small corrections to their matrix coefficients and the standard gradient algorithm is not efficient (Fig.6). Fig. 6. Learning error with regards to the iteration number. The learning parameter λ1 = 0.2 3.2 Linear Performance Balance Principle As stated above, the performance prediction principle is more general that inter- face prediction. The coordinator gets information not from all of the elements of the output vectors, Y1 , Y2 , as in the previous algorithm, but only some general information, such as the function value Φ1 , Φ2 and their derivatives. Fig. 7. Dynamic characteristics of the hidden interface value In Fig. 6, the quality of the learning process is shown . Error functions de- crease their value quite fast and the oscillation does not exist. On Fig.7. dynamic characteristics of interface value u1 , u5 , u11 are shown. Part of them change their value dynamically. 3.3 Nonlinear Performance Balance Principle The sub-network transfer function is nonlinear. The coordinator structure should consider this and use a more complicated target function structure. According to formulas (24)(26), the function Ψ includes a nonlinear part: the multiplication or sum of the second power target functions. In Fig. 8 the quality of the learning process is shown. A flex point for all of the sub-networks characterizes this learning process. After that, the learning process achieves the minimum value in an asymptotic way. Learning process quality is different for λ1 − 0.2. It is stable and smooth (Fig.9). The coordinator algorithm requires two parameters: the learning coefficient λ1 and the parameter c. The learning process quality depends on the ”c”-value. Fig. 10 shows this characteristic, including the iteration number for the flex point and the total iteration number. Fig. 8. The iteration number for nonlinear coordination function. λ1 = 0.3 Fig. 9. The iteration number for nonlinear coordination function. λ1 = 0.2 4 Conclusion In this article, only a single principle was examined. The interaction balance principle can be realized by measuring two different feedback signals: the sub- network output signals Y1 , Y2 or the sub-network performance value Φ1 , Φ2 . In the first case, the learning process is very flexible for the coordinator learning parameter λ. For λ = 0.3, the last learning process stage includes oscillation, which delays the process of convergence. For λ = 0.2 the characteristics are bet- ter. Learning processes are not simple and depend not only on an ANN structure Fig. 10. Optimal parameters for Nonlinear Performance Balance Principle but on the input data structure as well. To measure sub-network performance the coordinator can receive more general information. This has a positive impact on all the dynamic characteristics (Fig.6). In particular the oscillation seen in the middle of the learning process are smaller. From the coordinator point of view, every sub-network can be seen as a black box with the input signal U γ and output Y1 or Y2 . Response functions have to be nonlinear because the global tar- get function is also nonlinear. The coordinator, when using a stable coordinator algorithm with constant learning parameters is not responsible for finding the optimal interaction vector U γ during the entire learning process (from beginning to end). This has a negative impact on the quality and speed of convergence. Using the hierarchical principle, an additional level can be added: the adaptation level. The adaptation level, using its own identification algorithm, can predict learning parameters as λ and c. This theme should be studied in future. Finally, the nonlinear coordinator algorithm was examined. In that case all algorithms are very flexible with regarded to parameter c”, which decides about the impact of the nonlinear part of the coordinator algorithm for convergence. With a small value, learning time is very long, but for a large value, oscillations are seen. The characteristic of n = f (c), where: n- iteration number is shown in Fig. 10. References 1. Placzek Stanislaw. A two-level on-line learning algorithm of Artificial Neural Net- work with forward connections IJARAI, vol.3, no. 12, 2014 2. Placzek Stanislaw. Decomposition and the principle of interaction prediction in hi- erarchical structure of learning algorithm of ANN Poznan University of Technology, Academic Journal. Electrical Engineering, no 84, Poznan 2015 3. Mesarovic M.D., Macka D., Takahara Y. Theory of hierarchical multilevel systems, Academic Press, New York and London 1970 4. Haykin S. Neural network, a comprehensive foundation. Macmillan College Pub- lishing Company, New York 1994. 5. Vapnik V. Statistical Learning Theory Wiley, New York 1998 6. Osowski S. Sieci neuronowe w ujeciu algorytmicznym WNT Warszawa 1996