Coordinator Synthesis for Hierarchical Structure
         of Artificial Neural Network

                                  Stanislaw Placzek

                              Engineering Department
                                  Vistula University
                       Stoklosy Street, 02-787 Warsaw, Poland
                               stanislaw.placzek@wp.pl


      Abstract. Two important issues have to be dealt with when implement-
      ing the hierarchical structure [1] of the learning algorithm of an Artificial
      Neural Network (ANN). The first one concerns the selection of the gen-
      eral coordination principle. Three different principles are described. They
      vary with regard to the degree of freedom for first-level tasks. The sec-
      ond issue concerns the coordinator structure or coordination algorithm.
      The ANN learning process can be examined as a two-level optimization
      problem. Importantly all problems and sub-problems are unstructured
      minimization tasks . The article concentrates on the issue of the coor-
      dinator structure. Using the interaction prediction principle as the most
      suitable principle for two-level ANN structures, different coordinator tar-
      get functions are defined. Using classification task examples, the main
      dynamic characteristics of the learning process quality are shown and
      analyzed.


Keywords: Artificial Neural Network (ANN), hierarchy, decomposition, coor-
dination, coordination principle, coordinator structure


1   Computational task complexity

Large-scale multidimensional classification, interpolation and extrapolation are
complex tasks that can require long calculation times. For these tasks one can
make use of the ANN learning process using input and output data vectors with a
defined network architecture. There is no theoretical solution to the architecture
selection process, including the definition of the number of hidden layers and
neuron distribution between layers. From a calculation point of view, the ANN
learning process approaches a local or a global minimum asymptotically and is
very time-consuming. A multi-layered ANN with one input layer, a set of hidden
layers and an output layer can be sectioned off [1]. Every layer has its own input
and output vectors. For a standard two-layer network both the hidden layer and
the output layer can be described as sub-networks. These sub-networks form
the first-level of the hierarchical structure. So the network consists of two sub-
networks, and the local target function for each of them is defined Φ = (Φ1 , Φ2 ).
Similarly to the ANN structure decomposition, learning algorithm using error
backpropagation can also be decomposed. It can be decomposed into:
 – The first-level task, searching for the minimum of the local target functions
   Φ = (Φ1 , Φ2 ) ,
 – The second-level task, coordinating the first-level tasks.
Unfortunately, the first level optimization tasks in such learning algorithm are
non-linear. In practice, only standard procedures exist to solve these optimiza-
tion problems . But in a two-level learning algorithm structure, the coordinator
is not responsible for solving the global task. In [2], the most popular inter-
action prediction principle was implemented. According to this principle, the
coordinator plays an active role in the current ANN learning process. The main
interaction prediction principle is shown in Fig. 1. With each iteration, the coor-


                      Fig. 1. Interaction Prediction Principle


dinator and all of the first-level sub-tasks interchange information. The first-level
sub-tasks are optimal searching tasks. Usually, they look for the minimum of the
Mean Squared Error (MSE). The coordination algorithm structure is primary
related to the interaction prediction principle and two signals are used: primary
discerning U γ = (U1γ , U2γ ) and the feedback signal ascending  = (1 , 2 ). The
primary signals are known as coordination signals and are sent from the co-
ordinator to all of the first-level sub-tasks. Thereby the coordinator assists in
optimizing the first-level sub-tasks. The forecast coordination signals may not be
precisely correct and the sub-tasks calculate their own value of the coordination
signals. These signals are sent up into the coordinator. The coordinator then uses
its own gradient method to calculate the new value of the coordination signals
(improving upon its own previous estimate). The process can be continued until
the coordinator and all the first-level sub-tasks have solved their own tasks.

1.1   Coordination aspects
Coordination as an operation process is related to three types of decision prob-
lems [3]. Finding a solution for:
 – The global task as the primary task for the entire ANN structure,
 – Local tasks as the decomposition’s result of the global target function,
 – Coordinator task that should be synthesized or find a solution procedure.
In effect , three different specific problems have to be solved:
 1. Coordinator Synthesis. Decompose the global target function Φ into two sub-
    tasks with their own local target functions Φ1 , Φ2 ; the upper-level coordinator
    task should be defined in such a way that all the sub-tasks are coordinated.
 2. Modification Problem. In a two-level ANN learning algorithm; both the first-
    level tasks and the coordinator task are defined. Unfortunately, the ANN
    is not coordinated by the coordinator tasks. Before this problem can be
    dealt with, one should modify the first-level tasks in such a way that the
    coordinator coordinates the modified local target functions Φ = (Φ1 , Φ2 ).
    This can be formalized using the following predicate formula:

                         (∃γ)(∃W )[P(W, Φ)andP(U, Ψ (U γ , ))]                  (1)

    Where: The predicate P(W, Φ) is true, if Φ is a problem (task) and W is
    one of its solution. Ψ (U, )- coordinator target function. The first-level tasks
    are coordinated with the coordination task when the coordination task has
    a solution, all the first-level sub-tasks also have the solution for same coor-
    dination input U γ .
 3. Decomposition. Given only the global target task Φ for ANN - decompose
    this task Φ into the two sub-tasks Φ1 , Φ2 and find a coordinator structure
    and a coordination procedure. This is formalized as:

         (∃γ)(∃W )[P[W = (W1 , W2 ), (Φ1 (U ), Φ2 (U )andP(W, Ψ (X, Z, W ))]]    (2)

      Where: X - input file, Z - learning file, W = (W1 , W2 ) - ANN’s weight
      coefficients.
The first-level tasks are coordinated with a given global target task when the
global task has a solution and as do some of the first-level tasks. The coordinator
has to influence the first-level tasks in such a way that the resulting action
guarantees the solution of the global target task.

1.2     Decomposition and coordination
Using Fig. 1, one can define the set of target functions:
 – The global target function:
                                               2  N
                                            1 X
                         Φ(W 1, W 2, Y, Z) = ·   (yk − zk )2                     (3)
                                            2
                                                 k=1

      Where: Y [1 : N2 ] - the ANN output value, Z[1 : N2 ] - the the vector of
      teaching data, N2 - the number of output neurons.
 – The local target function Φ1
                                           N1    N0
                                         1X     X
                   Φ1 (W 1, X, U γ ) =        f(    W 1ij · xj ) − uγi )2       (4)
                                         2 i=1 j=0

   Where: U γ [1 : N1 ] - the coordination matrix as an input variable, N1 -
   the number of hidden neurons, N0 - the number of input neurons, f (∗) - a
   sigmoid function.
 – The local target function Φ2 .
                                           N2    N1
                                         1X     X
                   Φ2 (W 2, Z, U γ ) =        f(     W 2ki · uγi ) − zk )2      (5)
                                         2       i=0
                                           k=1

    Where: U γ [1 : N1 ] - the coordination matrix as an input variable, N2 - the
    number of output neurons, f (∗) - a sigmoid function

Using (4), one can calculate the feedback signal 1i and the new value of matrix
W1.
                                       XN0
                              1i = f (    W 1ij ẋj )                       (6)
                                          j=0

                            ∂Φ1
                                  = (v1i − uγi ) · f 0 · xj                     (7)
                           ∂W 1ij
                                                           ∂Φ1
                     W 1ij (n + 1) = W 1ij (n) − α1 ·                           (8)
                                                          ∂W 1ij
For the second sub-network using (5), one can calculate the new value of 2i and
the new value of matrix W 2ki .
                            ∂Φ2
                                  = (v2k − zk ) · f 0 · uγk                     (9)
                           ∂W 1ki
                                                            ∂Φ2
                     W 2ki (n + 1) = W 2ki (n) − α2 ·                          (10)
                                                           ∂W 1ki
                               1  N
                         ∂Φ2 X
                           γ =   (v2k − zk ) · f 0 · W 2ki                     (11)
                         ∂ui
                                 k=1

                                                         ∂Φ2
                          2i (n + 1) = uγi (n) − α3 ·                         (12)
                                                         ∂uγi
Where: α1, α2, α3 - learning coefficients, uγi = u1γi = u2γi - coordination signals,
1i , 2i - feedback signals.
     To summarize the first-level includes two sub-networks and two optimization
tasks. The first sub-network calculates the new coefficient matrix W 1ik (n +
1) and the feedback signal 1i value by taking the parameter u1γi from the
coordinator and using the optimization procedure. The feedback signal is sent
into the coordinator. For the second sub-network, the coordination signal u2γi
sets the optimization procedure in motion and calculates the new coefficient
matrix W 2ki (n + 1) and the feedback signal 2i value. The feedback signal is
also sent into the coordinator. After that, the coordinator procedure has to
calculate the new coordinator signal u1γi (n + 1) and u2γi (n + 1). Thus, design of
the coordinator structure is the main problem in a two-level learning algorithm.


2     Coordinator structure

In a two-level learning algorithm, the coordinator plays the main role. There-
fore, the choice of the coordinator principle is paramount. The principle should
specify strategies for the coordinator and determines the structure of the coor-
dinator. Three different approaches to how the interaction could be performed
were introduced [3].For an ANN learning algorithm, the Interaction Prediction
Principle is the most suitable. According to it, the coordinator predicts the in-
terface inputs. Success in coordinating all the first-level tasks depends on the
accuracy of the prediction of the interface inputs or the effect of the prediction
inputs. Therefore, to obtain responses from the first-level tasks, the coordinator
could measure:

 1. The interface inputs value. This principle will be known as the Interface
    Interaction Balance.
 2. The target function value. This principle will be known as the Performance
    Prediction Principle.


2.1   Interface Interaction Balance

The coordination input may involve a prediction of the interface between the first
and the second sub-networks. For the first sub-network, the coordinator signal
U1γ is sent into the target function Φ1 as a teacher data parameter. For the second
sub-network, the coordinator signal U2γ is treated as an input variable. Thanks
to it, both tasks are fully specified and algorithms can find the minimum value
of their target functions Φ1 , Φ2 . The Interface Interaction Balance idea is shown
in Fig. 2. The coordinator target function Ψ could be a linear or a nonlinear


      Fig. 2. Coordination algorithm structure for Interface Prediction Principle
function.
                           Ψ (U γ , γ ) = Ψ (U γ , U γ − Y )                   (13)
Equation (13) uses the linear relation between two feedback signals 1, 2 and
the prediction interface value U γ . Where: U γ - prediction interface value Y - real
interface value
                            N1                    N1
                         1X      γ     γ 2    1X
                    Ψ=         (u − 1i ) +          (uγ − 2γi )2              (14)
                         2 i=1 i              2 i=1 i

Where: U γ = [uγ1 , uγ2 ...uγN1 ]
  Using formula (14), the first derivative is calculated

                          ∂Ψ
                               = (uγi − 1γi ) + (uγi − 2γi )                  (15)
                          ∂uγi

                                                         ∂Ψ
                          uγi (n + 1) = uγi (n) − λ1 ·                          (16)
                                                         ∂uγi
Where: λ1 - learning coefficient,
The coordinator and the first-level sub-networks work in an iterative scheme.
When the coordinator signal U γ (n) is applied and the first-level optimization
tasks find their own solution, a new coordination signal can be calculated (16).
Using the Interface Interaction Balance, the two vectors signals are measured:
the predicted interface input U γ and the real interface value Y1 , Y2 (Fig.2). In
a real situation this requirement could be difficult to implement.It is usually
possible to measure the first-level tasks and to send into the coordinator the
target function (performance) Φ1 , Φ2 value and their derivatives.


2.2   Linear Performance Balance Principle

The target functions values and their derivatives are of a more generalized
form.The coordination scheme that uses this idea is shown in Fig. 3. The coor-


   Fig. 3. Coordination algorithm structure for Performance Prediction Principle


dinator can receive both the target function value and the derivative value from
the first-level sub-systems . The coordinator function can be defined as a linear
or a nonlinear function of the two parameters Φ1, Φ2

                                 Ψ = Ψ (Φ1, Φ2)                              (17)

Then the local target functions can be defined by :
                                               N 1
                                             1X
                      Φ1(Y 1, W 1, U γ ) =         (y1i − uγi )2             (18)
                                             2 i=1

                                               N
                                               2
                                             1X
                       Φ2(Y 2, W 2, Z) =         (y2k − zk )2                (19)
                                             2
                                              k=1

In this subsection, we examine the coordinator function (17) as linear relation:

                                  Ψ = Φ1 + Φ2                                (20)

Using (18) and (19) the first partial derivatives are calculated:

                                ∂Φ1
                                     = (y1i − uγi )                          (21)
                                ∂uγi

                               2 N
                         ∂Φ2 X
                           γ =   (y2k − zk ) · f 0 · W 2ki                   (22)
                         ∂ui
                                k=1

Using the same gradient algorithm for the coordinator as above, the new coor-
dination signal U γ (n + 1) is calculated using the formula

                                                        ∂Ψ
                         uγi (n + 1) = uγi (n) − λ1 ·                        (23)
                                                        ∂uγi

The global target function Ψ is a nonlinear function that should take into account
the sigmoid activation functions for both the first and the second sub-network.
Because of this, one can say that the coordination function described by formula
(20) is simplistic and does not consider the non - linearity of the coordinator
structure.

2.3   Nonlinear Performance Balance Principle
In the article, non - linearity will be demonstrated by two coordinator target
functions
                           Ψm = Φ1 + Φ2 + c · Φ1 · Φ2                     (24)
                         Ψp = Φ1 + Φ2 + c · (Φ12 + Φ22 )                     (25)
Where:
Ψm - indicates non-linear multiplication part,
Ψp - indicates non-linear sum of power part.
parameter ”c” describes the impact of the non-linear part on the coordinator
algorithm.
    The structure of target functions Φ1 , Φ2 are the same as in formula (18)
and (19), respectively. The final formulas for derivatives have more complicated
structures                                                             
                ∂Ψm      ∂Φ1 ∂Φ2              ∂Φ1            ∂Φ2
                      =       +      +c·             · Φ2 +        · Φ1     (26)
                ∂uγi     ∂uγi   ∂uγi          ∂uγi           ∂uγi
                                                                        
               ∂Ψp     ∂Φ1 ∂Φ2                  ∂Φ1           ∂Φ2
                    =        +      + 2 · c ·          · Φ1 +        · Φ2   (27)
               ∂uγi    ∂uγi    ∂uγi             ∂uγi          ∂uγi
In practice, calculation developers are obliged to find the optimal ”c” value.
This parameter will have a significant impact on the quality and stability of the
coordinator learning process.


3   Classification task example
The main dynamic characteristics of the learning process can be shown using the
following example. Emphasis is put on the characteristics of the first-level local
target functions, Φ1 , Φ2 , and the second level, coordinator target function Ψ .
Optimal ANN learning characteristics depend on two connected tasks. Firstly,
a network structure that includes a number of hidden layers needs to be cre-
ated.Secondly, neurons need to be distributed between layers. A single hidden
layer is chosen in this example. The structure of the ANN is simple and can
be described as ANN (N0 − N1 − N2 ). In literature, one can only find sug-
gestions regarding optimal numbers of neurons using the Vapnik - Cervonenkis
dimension [4][5]. However, the hidden layer structure could also be determined
using Kolmogorov’s theorem [4][6]. For an ANN with one hidden layer, a sigmoid
activation function can be chosen for classification tasks and N0 input neurons,
                               V Cdim = N0 + 1                               (28)
Therefore, we can use this measure to define the number of neurons in the hidden
layers
                                 N1 = V Cdim                                 (29)
For a continuous function with N0 input vector dimension and N2 output vector,
the number of neurons according to Kolmogorov’s theorem can be calculated as
                                N1 = 2 · N0 + 1                              (30)
In practice, the number of neurons in the hidden layer will be chosen according
to the formula
                           N0 + 1 ≤ N1 ≤ 2 · N0 + 1                        (31)
Sigmoid activation functions are implemented in both the hidden and output
layers. For the classification task using 6 - dimension input vectors N0 = 6 and
N2 = 1, the number of hidden neurons is calculated according to formula
                                  7 ≤ N1 ≤ 13                                (32)
3.1   Example for Interface Interaction Balance

The quality of the leaning process depends on the key learning parameters for
the first-level as well as the coordinator level. Fig. 4 shows the dynamic charac-
teristics of the learning process.


               Fig. 4. The first sub-networks learning error. λ1 = 0.5


                  Fig. 5. The coordinator learning error. λ = 0.2


    The main learning parameters, α1 = 0.3, α2 = 0.3, guarantee that every
sub-task can find its own minimum target function value. The coordinator has
its own algorithm described by equation (23). If parameter λ1 is too large, the
learning process is not stable, especially at the end of the iterative process.
    The first sub-network includes 6 input neurons and 13 output neurons. Matrix
W1 includes 7*13 = 91 neurons, but matrix W2 only 14*1=14 neurons. At
each stage the first sub-network calculates a lower target function Φ1 as the
second sub-network. In the middle and the final part of the iterative process, the
characteristics are the same. This means that the sub-networks achieved only
small corrections to their matrix coefficients and the standard gradient algorithm
is not efficient (Fig.6).
Fig. 6. Learning error with regards to the iteration number. The learning parameter
λ1 = 0.2


3.2   Linear Performance Balance Principle

As stated above, the performance prediction principle is more general that inter-
face prediction. The coordinator gets information not from all of the elements of
the output vectors, Y1 , Y2 , as in the previous algorithm, but only some general
information, such as the function value Φ1 , Φ2 and their derivatives.


            Fig. 7. Dynamic characteristics of the hidden interface value


   In Fig. 6, the quality of the learning process is shown . Error functions de-
crease their value quite fast and the oscillation does not exist. On Fig.7. dynamic
characteristics of interface value u1 , u5 , u11 are shown. Part of them change their
value dynamically.


3.3   Nonlinear Performance Balance Principle

The sub-network transfer function is nonlinear. The coordinator structure should
consider this and use a more complicated target function structure. According
to formulas (24)(26), the function Ψ includes a nonlinear part: the multiplication
or sum of the second power target functions. In Fig. 8 the quality of the learning
process is shown. A flex point for all of the sub-networks characterizes this
learning process. After that, the learning process achieves the minimum value in
an asymptotic way. Learning process quality is different for λ1 − 0.2. It is stable
and smooth (Fig.9). The coordinator algorithm requires two parameters: the
learning coefficient λ1 and the parameter c. The learning process quality depends
on the ”c”-value. Fig. 10 shows this characteristic, including the iteration number
for the flex point and the total iteration number.


     Fig. 8. The iteration number for nonlinear coordination function. λ1 = 0.3


     Fig. 9. The iteration number for nonlinear coordination function. λ1 = 0.2


4   Conclusion

In this article, only a single principle was examined. The interaction balance
principle can be realized by measuring two different feedback signals: the sub-
network output signals Y1 , Y2 or the sub-network performance value Φ1 , Φ2 . In
the first case, the learning process is very flexible for the coordinator learning
parameter λ. For λ = 0.3, the last learning process stage includes oscillation,
which delays the process of convergence. For λ = 0.2 the characteristics are bet-
ter. Learning processes are not simple and depend not only on an ANN structure
     Fig. 10. Optimal parameters for Nonlinear Performance Balance Principle


but on the input data structure as well. To measure sub-network performance
the coordinator can receive more general information. This has a positive impact
on all the dynamic characteristics (Fig.6). In particular the oscillation seen in
the middle of the learning process are smaller. From the coordinator point of
view, every sub-network can be seen as a black box with the input signal U γ and
output Y1 or Y2 . Response functions have to be nonlinear because the global tar-
get function is also nonlinear. The coordinator, when using a stable coordinator
algorithm with constant learning parameters is not responsible for finding the
optimal interaction vector U γ during the entire learning process (from beginning
to end). This has a negative impact on the quality and speed of convergence.
Using the hierarchical principle, an additional level can be added: the adaptation
level. The adaptation level, using its own identification algorithm, can predict
learning parameters as λ and c. This theme should be studied in future. Finally,
the nonlinear coordinator algorithm was examined. In that case all algorithms
are very flexible with regarded to parameter c”, which decides about the impact
of the nonlinear part of the coordinator algorithm for convergence. With a small
value, learning time is very long, but for a large value, oscillations are seen. The
characteristic of n = f (c), where: n- iteration number is shown in Fig. 10.


References
1. Placzek Stanislaw. A two-level on-line learning algorithm of Artificial Neural Net-
   work with forward connections IJARAI, vol.3, no. 12, 2014
2. Placzek Stanislaw. Decomposition and the principle of interaction prediction in hi-
   erarchical structure of learning algorithm of ANN Poznan University of Technology,
   Academic Journal. Electrical Engineering, no 84, Poznan 2015
3. Mesarovic M.D., Macka D., Takahara Y. Theory of hierarchical multilevel systems,
   Academic Press, New York and London 1970
4. Haykin S. Neural network, a comprehensive foundation. Macmillan College Pub-
   lishing Company, New York 1994.
5. Vapnik V. Statistical Learning Theory Wiley, New York 1998
6. Osowski S. Sieci neuronowe w ujeciu algorytmicznym WNT Warszawa 1996