-

Coordination algorithm in hierarchical structure of the learning process of Arti cial Neural Network

Stanislaw Placzek

stanislaw.placzek@wp.pl 0

Bijaya Adhikari

bijaya.adhikari1991@gmail.com 0 0 Vistula University , Warsaw

While analyzing Arti cial Neural Network structures, one usually nds that the rst parameter is the number of the ANN layers . Hierarchical structure is an accepted default way to de ne ANN structure . This structure can be described using di erent methods, mathematical tools, software and/or hardware realization. In this article, we are proposing ANN decomposition into hidden and output sub networks. To build this kind of learning algorithm, information is exchanged between the rst sub networks level and the second coordinator level in every iteration .Learning coe cients are tuned in every iteration. The main coordination task is to choose the coordination parameters in order to minimize both the global target function and all local target functions. In each iteration their values should decrease in asymptotic way to achieve the minimum. In article learning algorithms using forecasting of sub networks connectedness is studied .

Many ANN structures are in practice. The most popular among them is the one with Forward Connections having complete or semi-complete set of weight coe cients. For special needs, ANNs with Forward Cross Connections and Back Connections are used. The full structure of ANN. is depicted on Fig.1. To describe the structure ,independent of the ANN complexity, partition on layers is used: the input layer, one or more hidden layers, and the output layer. Input layer connects ANN with external world ( environment) and performs initial processing , calibration or ltering of input data. The hidden layers are used for main data processing.

In most common structures, hidden layers include more neurons than input layer and they use non-linear activation function. The output layer which sums all signals from hidden layers uses two types of activation functions: linear activation function for classi cation tasks and non-linear sigmoid or tanth activation functions for approximation tasks . In this paper, to avoid confusion regarding the number of layers, only the hidden layers and the output layer are included. The concept of layers in ANN structures re ects the silent assumption that ANN structures are hierarchical. Taking this into account as very important feature of ANN , to describe the network characteristic, a couple of the conceptions can be used. 1.1 To analyze ANN structure, verbal description is used so as to help everybody understand how ANN is built. For more detailed analysis, mathematical description using algebra and/or di erential equations is required. Based on these descriptions, ANNs are then implemented by a computer program or an electronic device. So, to achieve complete description of ANN, concepts and models from di erent elds of science and technology have to be used.

Every model uses its own set of variables and terminology in di erent abstract level. To describe and understand how a particular ANN is working , some hierarchical set of abstract concepts are used. To separate these concepts from the layer description, a new name is used [15] { delamination of ANN into abstract strata. 1.2

Calculation complexity or decision taking.

For multi- layered ANN a lot of hidden layers and output layer can be sectioned o . Every layer has own output vector that is an input vector of the next layer, vi i = 1; 2; :::n;. Both hidden layers and output layer can be described as sub - networks. \n" de nes the total number of sub networks. ANN logic decomposition depends on layers separated by establishment of extra output vectors vi i = 1; 2; :::n. Now the network consists of the set of sub- networks , for each of which local target function is de ned by = ( 1; 2; ::: n).

Similar to ANN structure decomposition, learning algorithm using error back propagation can be decomposed too. (Fig.3.). We can sort out: - The rst level task in which the minimum of the local target functions i i = 1; 2; :::n is searched.

- The second level task which has to coordinate the all rst level tasks.

In a learning algorithm constructed this way, there is a set of optimization tasks on the rst level . These tasks are searching for the minimum value of target function . Unfortunately these are non- linear tasks without constrains. In practice, standard procedures to solve these problems exist. But in two level learning algorithm structure, coordinator is not responsible for solving the global task. Coordinator is obliged to calculate the value of coordination parameters = ( 1; 2; : : : n) for every task on the rst level . The rst level , searching for the solution of all tasks have to use the coordination parameters value. It is an iterative process. Coordinator in every iteration cycle receives new values of feedback parameters = ( 1; 2 : : : n) from the rst level tasks. Using this information coordinator has to make new decisions { calculate the new coordination parameters value. These procedures could be relatively complicated and in the most situations they happen to be non { gradient procedures. In the hierarchical learning algorithm , target functions can be de ned as: Global target function

, Set of local target functions for every sub network i where i = 1; 2; :::n, Coordinator target function .

According to [15][2], solution of the primary task depends on the minimum global target function . The rst level tasks should be built in a way that when all the rst level tasks are solved, the nal solution must be achieved { the minimum of the global target function. This kind of strati ed structure is known as level hierarchy [15].

To summarize we conclude:

Complexity of the problem increases from the rst level to the second. Coordinator needs more time to solve its own tasks.

Coordination tasks could be non { parametric procedures. To study dynamics of changing target functions value , coordinator should have the ability to change ( or changing) learning parameters in the rst level tasks. As stressed above all the rst level task are non { liner and have to be solved using iteration procedures.

For di erent tasks, characteristic of ANN learning processes could be different . Coordinator studying feedback information from the rst level tasks should have the ability to change all parameters in the both coordinator and the rst level procedures. 2

Decomposition and coordination of ANN learning algorithm The two layered ANN with one hidden layer and output layer using full internal forward connections does not have Cross Forward and Back Connections. This kind of networks can be used for both approximation and classi cation tasks. According to concept introduced above this ANN can be described by using two strata. 2.1

Verbal description of Structure. Stratum 2.

ANN with full forward connection contains one hidden layer. In this layer connections between input vector X and output vector V 1 are represented by matrix W 1. All matrix coe cients are de ned . Connections in the output layer are de ned by matrix W 2. Matrices connect input vector V 1 and output vector Y . In this matrix all weight coe cient are de ned , too. Number of input neurons is de ned by vector X which has dimensionality of N0. In the same way number of neurons in the output layer is de ned by vector Y which has dimensionality of N2. Number of neurons in the hidden layers ,N1, depends on complexity of problem. Usually N1 > N0, so data is not compressed in the rst layer. Based on the description introduced above, the ANN can be set o as hierarchical level structure (Fig.4).In the rst level, two local target functions, 1 for the rst sub-network and 2 for the second sub-network, are de ned. On the second level, coordinator is established. Its main goal is to coordinate all the rst level tasks and to achieve the minimum of the global target function . For coordinator two functions G and H are de ned which transforms coordination signals (V 21; V 12) and feedback signals (V 1; V 2). At the same time, coordinator should have the ability to change value of learning coe cients 1 and 2 by using transformation functions h1( 1; 2) and h2( 1; 2) (Fig.4.). In the decomposed ANN structure we can de ned the next target functions: Global target function . For all epoch:

N2 Np (W 1; W 2; X; Y ) = X X v2pk = f (epk)

N2 epk = X W 2ki v12ip

i=0 Where: f - sigmoid function, i = 1; 2:::N1 k = 1; 2; ::N2,

2pk- local target function for "k" output of the second sub-network and pth element of training set.

On the rst level two minimization task 1 and 2 have to be solved. These target functions have additive structures. Both could be divided into N1 and N2 sub-tasks respectively. This can be used to build programming procedures using appropriate programming language. So, we can formulate N1 sub-tasks

N1 min 1 = min X i=1

N1 Np N0 1i = X X(f [X W 1ij xjp] v21ip)2

i=1 p=1 j=0

Np = X(v1ip p=1

v21ip) W 1ij(n + 1) = W 1ij 1

xjp For i = 1; 2; :::N1 j = 0; 1; 2:::N0

1i-local tarall get function for "i" output of the rst sub-network and for whole training set.

In the same way can be formulated N2 sub-tasks

N2 min 2 = min X k=1

N2 Np N1 2k = X X(f [X W 2ki v12ip] zp)2

k k=1 p=1 i=0

Np = X(v2pk

p=1 k=1 p=1 zkp) zkp) W 2ki(n + 1) = W 2ki(n)

For k = 1; 2; ::::N2 i = 0; 1; 2; :::::N 1 2k-local target function for "k" output of the second sub-network and for whole training set. ( 6 ) ( 7 ) ( 8 ) ( 9 ) ( 10 ) ( 11 ) ( 12 ) ( 13 ) ( 14 ) Coordinator target function ! = ( 1; 2; V 1; V 2) ( 15 )

The rst level tasks calculate control parameters and send them to coordinator. Additionally , in every iteration, coordinator analyzes the local target functions 1i(n) and 2k(n). This information is necessary to calculate the new vector value V 21. At the same time coordinator should have the ability to interfere in learning process by selecting new value of learning parameters 1; 2; 2. Coordinator can calculate the value of target function by itself using data sent to it by the rst level. We should stress that values of the target function change dramatically during the learning process. We observed that values of 1i(n) and 2k(n) changed signi cantly during several hundred iterations. At the same, during learning process the values of the target functions can increase to a big value and then decrease drastically. This process explains that ANN, at the beginning of the learning process, has to attune the weight coe cients of the W1 matrix . In the next step both 1 and 2 target functions change their value in an asymptotic way to achieve their minimum. This means that weight coe cients for both W 1 and W 2 matrixes are near the stable values and only small corrections are pursued. So, coordinator should study not only the target functions but their dynamic changing process too. 3

Example In an example the main dynamic characteristics of the learning process are shown. The stress is made on the characteristic of the rst level local target functions 1 , 2. The structure of ANN is simple and can be described as ANN(3-5-1). This mean that ANN includes, 3 input neurons, 5 neurons in hidden layer and 1 output neuron . Sigmoid activation functions are implemented in both hidden and output layers. Three arguments of XOR function is fed as input data. So, every epoch includes 8 vectors. Changing di erent learning parameters as 1; 2; 2; 1; 2 dynamic characteristics have been studied.

. In the second part of the test, the simple adaptive coordination algorithm was used. Fig.5. shows how the two target functions 1, 2 changed their value during learning process ( iterations' number). The quality of dynamic processes is di erent. The function 2, represent the second local target function ( output one). This process is smooth. This means that at (during) the learning process the value of 2 decreases at a constant rate to the minimum value. Midway through the process, its value decreases very slow. This is correlated with the rst target function 1 ( hidden layer). This quality is quite di erent. From start to 3700 iterations target function 1 increased its value. Two local maximum in 1000 iterations and 3700 iterations are seen. After that, both 1 and 1 functions decreases their value and in the asymptotic way achieves the minimum.

As we stressed in previous sections , hidden layer can be divide into 5 subnetworks. Fig.6. shows the outputs of the three sub-networks ( 11; 13; 14). The quality of dynamic characteristics are the same, but maximum of the amplitudes are di erent.

In the next gure (Fig.7), we can see that the quality of learning process depends on 2 parameter. This parameter is calculated by coordinator and has impact on the forecast of the vector V 21 value., For 2 =0.1 , that is too small, learning process isn't smooth. small oscillations can be seen. But if 2=0.5 is too big, the amplitude increase its value more than 5 times . So, coordinator should calculate 2 using own adaptive algorithm which should achieve from the rst level and analyze the target functions 1 and 2.

To study impact of the coordinator on the quality of learning process , adaptive algorithm changes two parameters 1 { learning rate for the hidden layer, and 2 - learning rate for the vector V 21 . Vector V 21 forecasts the hidden layer's output (Fig.8). When 2 is greater than 1 learning rates increases. Their values were increased in very small steps of only 0.05. Learning rates of both 1 and 2 shouldn't be extremely large or small. So two extra constraints were used. (Fig.5). shows the coordinator's nal impact on the quality of the learning process. Target function 2 decreases its value throughout the learning process, but target function 1 still has the two maximum values. This problem will be studied in future work.

(Fig.10) shows how the value of two learning rates are changed by coordinator. In [15], few of coordination principles are de ned for big hierarchical systems structure. In this article, the following principle is used - the forecast of the connections between sub-networks. In the hierarchical structure of ANN, coordinator should forecast the value of the vector V 21. This value should be the same as the real value of the hidden layer output V 1. In this situation global target function should achieve its minimum value and then the learning process is nishes.

If the rst level of local target functions both 1 and 2 meet a couple of conditions [2][15],then convergence is guaranteed. Unfortunately the global target function isn't concave and could have a lot of local minimum. Therefore, it is not possible to prove that algorithm is stable and convergent mathematically. But the rst local target functions didn't include any constraints and that helps while build learning algorithm. (Fig.10.) shows the nal result of the di erent characteristics of the learning processes.

In the learning processes shown in (Fig.10). all rates were const. Coordinator calculates the new V 21 value using 2. (Fig.5.) shows that value of target function 2 doesn't change its value between 2000 and 3700 iterations. This is due to the fact that ANN in the rst order has to stabilize the W 1 matrix weight coe cients. This process depends on V 21 vector. When all the W 1 weights coe cients are stable , matrix W 2 then stabilizes its weight coe cient. In this ANN, the rst layer played the most important role. The sub-networks impact on the nal value of the rst layer's target function 1 is di erent. There are components in which its impact is very small. This can be explained by the hidden layer structure . The hidden layer includes structural neurons redundancy. Finally, the coordination algorithm is analyzed. Learning rates 1 and 2 didn't achieve their maximum value. Probably the value of the learning rate should be calculated using not only the relation between 1 and 2, but also using their dynamic characteristics as the rst di erence

1(n) = 1(n) 1(n 1) and 2(n) = 2(n) 2(n 1).

This implies that coordinator should implement the PID controler algorithm. 2(n + 1) = 2(n) + 1 1(n) + 2 ( 1(n) 1(n 1)) (16)

This two problems described above should be studied in the future work.

[1] Ch. M. Bishop , Pattern Recognition and Machine Learning , Springer Science + Business Media, LLC, 2006

[2]

Findeisen ,

Szymanowski ,

Wierzbicki , Teoria i metody obliczeniowe optymalizacji . Panstwowe Wydawnictwo Naukowe , Warszawa 1977 .

[3]

D. J.

Montana , L. Davis Training Feed Forward Neural Networks Using Genetic Algorithms . IJCAI Detroit, Michigan 1989 .

[4]

Osowski , Sieci Neuronowe do Przetwarzania Informacji. O cyna Wydawnicza Politechniki Warszawskiej , Warsaw 2006 .

[5]

Osowski , Sieci neuronowe w ujeciu algorytmicznym . WNT ,Warszawa 1996 .

[6]

Toshinori

Munakate , Fundamentals of the New Arti cial Intelligence . Second Edition , Springer 2008 .

[7]

Fyle , Arti cial Neural Networks and Information Theory , Department of Computing and information Systems , The University of Paisley, 2000 .

[8]

Marciniak ,

Korbicz ,

Kus , Wstepne przetwarzanie danych, Sieci Neuronowe tom 6 , Akademicka

cyna Wydawnicza

EXIT

, Warsaw 2000 .

[9]

Mikrut ,

Tadeusiewicz , Sieci neuronowe w przetwarzaniu i rozpoznawaniu obrazow , Sieci Neuronowe tom 6 , Akademicka

cyna Wydawnicza

EXIT

2000 .

[10]

J. R.

Rabunal ,

Dorado , Arti cial Neural Networks in Real-Life Applications , Idea Group Publishing 2006 .

[11]

Placzek ,

Adhikari , Analysis of Multilayer Neural Networks with Direct and Cross-Forward

Connection

, CS &P Conference in the University of Warsaw, Warsaw 2013

[12]

Marciniak ,

Korbicz , Neuronowe sieci modularne, Sieci Neuronowe tom 6 , Akademicka

cyna Wydawnicza

EXIT

2000 .

[13] Zeng-Guang Hou.Madan M.Gupta , Peter N. Nikiforuk , Min Tan, and Long Cheng, A Recurrent Neural Network for Hierarchical Control of Interconnected Dynamic Systems , IEEE Transactions on Neural Networks , Vol. 18 , No. 2, March 2007 .

[14]

Rutkowski , Metody i techniki sztucznej inteligencji , Wydawnictwo Naukowe

PWN

, Warsaw 2006 .

[15] M. D. Mesarocic , D.

Macko , and Y.

Takahara , Theory of hierarchical multilevel systems , Academic Press, New York and London, 1970 .