1. Introduction

The Fourth International Workshop on Computer Modeling and Intelligent Systems, April

Long‐Term Forecasting Method in the Supply Chain Based on an Artificial Neural Network with Multi‐Agent Metaheuristic Training

Eugene Fedorov

Olga Nechyporenko

0 0 Cherkasy State Technological University , Shevchenko blvd., 460, Cherkasy, 18006 , Ukraine

2021

27 2021 0000 0003

The problem of increasing the efficiency of long-term forecasting in the supply chain is examined. Neural network forecasting methods that are based on reservoir calculations, which increases the forecast accuracy, are proposed. Methods for identifying parameters of forecast models based on the metaheuristics are proposed for the methods mentioned above. These methods were researched on the basis of the data from the logistics company Ekol Ukraine and are intended for intelligent computer-based supply chain management systems.

1 long-term forecast supply chain metaheuristics reservoir computing forecast neural network model

1. Introduction

factors is possible; a complete enumeration of all possible models is not required; analysis of systems with heterogeneous factors is possible.

However, neural network methods have a lack of transparency, the complexity of the architecture definition, strict requirements for the training sample, the complexity of the training algorithm choice, and the resource-intensiveness of the training process. Therefore, the task of increasing the efficiency of neural network forecast is urgent.

The aim of the work is to develop a method for long-term forecasting in the supply chain. To achieve the goal, the following tasks were set and solved:  analyze existing forecast methods;  propose a neural network forecast model;  choose a criterion for evaluating the effectiveness of a neural network forecast model;  propose a method for determining the values of the neural network forecast model parameters based on multi-agent metaheuristics;  perform numerical studies.

2. Problem statement

The problem of increasing the efficiency of long-term forecasting in the supply chain is reduced to the problem of finding such a vector of parameters W , which satisfies the forecast model adequacy criterion F  1 P

 ( f (x ,W )  d ) 2  min , i.e. deliver the minimum of the mean squared error (the P  1 W difference between the model output and the desired output), where P – test set cardinality, x .– th training input value, d .– th training input value.

3. Literature review The most commonly used forecast neural networks are:

1. Long short-term memory (LSTM) [11, 12];

This network is based on gates (FIR filters) and a multilayer perceptron. Instead of each hidden neuron, it uses a memory block that contains one or more cells, and is connected with input, output and forget gates. Gates determine how much information to pass through. If the input and output gates are close to 1 and the forget gate is close to 0, then the network turns into an Elman network. If the input gate is close to 0, then the short-term information from the input is ignored. If the forget gate is close to 0, then long-term information from the memory block is ignored. If the output gate is close to 0, then the output information is ignored. The advantage of this network is a higher forecast accuracy than in a conventional multilayer perceptron. The disadvantages are the complexity of the memory blocks implementation, insufficient forecast accuracy, the complexity of defining the architecture, insufficient learning rate.

2. Gated recurrent unit (GRU) [13-15];

This network is based on gates (FIR filters) and a multilayer perceptron. Instead of each hidden neuron, it uses a hidden block that is connected with reset and update gates. Gates determine how much information to pass through. If the reset gate is close to 1 and the update gate is close to 0, then the network turns into an Elman's network. If the reset gate and update gate are close to 0, then the long-term information from the hidden block is ignored and the network becomes a multilayer perceptron. If the update gate is close to 1, then the short-term information from the network input is ignored. The advantage of this network is a higher forecast accuracy than in a conventional multilayer perceptron. The disadvantages are the complexity of the hidden blocks implementation, insufficient forecast accuracy, the complexity of defining the architecture, insufficient learning rate. 3. Neural Turing machine (NTM) [16, 17];

This network is based on a Turing machine and a multilayer perceptron or LSTM and includes a controller and a memory matrix. At any given time, the controller receives input from the outside world and sends the output to the outside world. The controller also reads from the memory matrix cells via the read heads and writes to the memory matrix cells via the write heads. The advantage of this network is a higher forecast accuracy than in a conventional multilayer perceptron. The disadvantages are the complexity of the controller implementation (in the case of LSTM) and the complexity of defining the architecture, insufficient forecast accuracy, insufficient learning rate. 4. Echo state network (ESN) [18, 19];

This network is based on reservoir computing over sigmoid neurons and a multilayer perceptron. The hidden layer is called the reservoir. Each neuron in the reservoir may be unconnected or connected to other neurons in the reservoir. To train the network, the pseudoinverse matrix method is used. The advantages of this network are the highest forecast accuracy (due to the pseudoinverse matrix method) and the ease of implementation of sigmoid neurons in the reservoir. The disadvantages are the complexity of parallel learning and the complexity of defining the architecture. 5. Long short-term memory (LSM) [20-23].

This network is based on reservoir computations over impulse neurons «Leaky Integrate and Fire» (LIF) and multilayer perceptron. Each neuron in the reservoir may be unconnected or connected to other neurons in the reservoir and is excitatory or inhibitory. A gradient learning method is used to train the network. The advantages of this network are a higher forecast accuracy than in a conventional multilayer perceptron and the possibility of parallel training for the part of the network corresponding to a multilayer perceptron. The disadvantages are the complexity of the implementation of impulse neurons, the complexity of defining the architecture and less high prediction accuracy, the complexity of parallel training for the part of the network corresponding to the reservoir.

Usually, the methods listed above either have a low forecast accuracy (due to falling into a local extremum) or a low learning rate (due to the high computational complexity of the hidden neuron or the complexity of parallelization of training) or the complexity of implementation (due to the complexity of the hidden neuron architecture) or the complexity of defining the architecture, which leads to a decrease in forecast efficiency.

Due to this, creation of a neural network with a training method and architecture that will eliminate the indicated disadvantages is an urgent task.

4. Block diagram of a neural network model for a long‐term forecast

Figure 1: Block diagram of a long‐term forecast model based on a fully connected echo state network with a cascade of unit delays for an input layer neuron (FC‐ESN type 1) 5. Neural network models for long‐term forecast 5.1. Long‐term forecast model FC‐ESN type 1 1. Initialization 2. Forecast 2.1. Initialization of the outputs of the neurons of the input layer

yi(0) (n)  xi , 2.2. Calculation of the outputs of the neurons of the hidden layer where N ( 1 ) – the number of neurons in the first layer, M (k) – the number of unit delays for the kth layer, wi(jk ) – the connection weight from the ith neuron to the jth neuron on the kth layer, b(jk) – displacement (thresholds) on the kth layer, y (jk) (n) – the output of the jth neuron on the kth layer at time n, f (k) – neurons activation function on the kth layer (usually f (k ) (s)  tanh(s) ). 5.2. Long‐term forecast model FC‐ESN type 2 1. Initialization

yi( 1 ) (n  1)  0 , i 1, N ( 1 ) . y (j1) (n)  f ( 1 ) (s (j1) (n)) , j 1, N ( 1 ) ,

M (0) s (j1) (n)  b (j1) (n)   wi(j1) y (0) (n  i) 

i0

M (0) N( 1 ) M ( 2 )   wi(j1) y ( 2 ) (n  (i  (M (0)  N ( 1 ) )) ,

iM (0) N ( 1 ) 1 2.3. Calculation of the outputs of the neurons of the output layer M (0) N ( 1 )  wi(j1) y ( 1 )

iM (0) (n  1)  iM (0) 1 y ( 2 ) (n)  f ( 2 ) (s ( 2 ) (n)) ,

M (0) s ( 2 ) (n)  b ( 2 ) (n)   wi( 2 ) y (0) (n  i)  i0 6. Criterion for evaluating the effectiveness of a neural network model for long‐term forecast

In this work, to determine the parameters values of the FC-ESN model, the criterion of the model adequacy was chosen, which means the choice of such values of the parameters W  {wi(j1) , wi( 2 ) } , which deliver the minimum of the mean squared error (the difference between the model output and the desired output):

F  1 P

 ( y( 2 )  d )2  min , P  1 W ( 1 ) where P – the test set cardinality. 7. Method for determining the parameters values of the neural network model for long‐term forecast

The method for determining the parameters values of the neural network model for long-term forecasting is reduced to calculating the weights of the hidden layer and the output layer of the FCESN model.

7.1. Calculating the weights of the hidden layer

The weights of the hidden layer are calculated as follows: 1. Initialize randomly biases (thresholds) b(j1) and weights wi(j1) .

2. Make up from weights

wi(j1) , i  M (0)  1, M (0)  N ( 1 ) , j 1, N ( 1 ) , matrix W  [wij ] , i, j 1, N ( 1 ) .

  3. Determine the matrix W as W  

W max {|  j |} j1,N ( 1 )

 where  – spectral radius of the matrix W (for large  learning is faster, but long short-term memory decreases), 0    1 ,  j – eigenvalues of matrix W .

4. Assign to the weights wi(j1) (n) , i  M (0)  1, M (0)  N ( 1 ) , j 1, N ( 1 ) , the values of the  corresponding elements of the matrix W . 7.2. The output layer weights calculation based on the multi‐agent metaheuristic SAPSO method

The proposed SAPSO (simulated annealing and particle swarm optimization) method for numerical functions optimization consists of the following blocks (Figure 3).

1. Initialization 2. Modification of the speed of each particle using

simulated annealing 3. Modification of the position of each particle 4. Determination of the particle of the current

population with the best position

5. Determining the global best position

6. n<N

yes not x* Figure 3: The sequence of procedures of the optimization method based on the multi‐agent metaheuristic SAPSO method

Block 1 - Initialization:  setting the maximum number of iterations N ;  setting the size of the swarm K (usually no more than 40);  setting the dimension of the particle position M (corresponds to the number of weights in the output layer);  setting the number of the current iteration n to one;  initialization of position xk (corresponds to the solution, i.e. the vector of the weights of the output layer)

xk  ( xk1 ,..., xkM ) , xij  ( x mjax  x mjin )U ( 0,1 )  x mjin , k 1, K , where U ( 0,1 ) – a function that returns a uniformly distributed random number in a range [0,1] , x mjin , x mjax – minimum and maximum value;     initialization of personal (local) best position x best

k xkbest  xk , k 1, K ; speed initialization vk creating an initial particle swarm

vk  (vk1 ,..., vkM ) , vij  0 , k 1, K ; determination of the particle of the current population with the best position

Q  {(xk , xkbest , vk )} ; k *  arg min F ( xk ) ,

k1,K x*  xk* .

Block 2 - Modification of the speed of each particle using simulated annealing Block 2.1 – Calculating two vectors of random numbers for each particle r1k  (r1k1,..., r1kM ) , r1kj {U ( 0,1 ),C( 0,1 ), N ( 0,1 )} , k 1, K , j 1, M , r 2k  (r 2k1,..., r 2kM ) , r 2kj {U ( 0,1 ),C( 0,1 ), N ( 0,1 )}, k 1, K , j 1, M , where N ( 0,1 ) – a function that returns a random number from a standard normal distribution, C( 0,1 ) – a function that returns a random number from a standard Cauchy distribution, Block 2.2 – Calculating annealing temperature

T (n)   T (n 1) , T (0)  T0 ,   N  N11 , T0  N N 1 ,

N where T (n) – annealing temperature at iteration n ,

T0 – initial annealing temperature,

 – parameter controlling annealing temperature.

Block 2.3 – Calculating parameter controlling the contribution of the component 1(n)   2 (n)   (0) exp(1/ T (n)) , w(n)  w(0) exp(1/ T (n)) ,

1  (0)   0  0.5  ln 2 , w(0)  w0  , 2ln 2 where  1(n) – parameter controlling the contribution of the component (xkbest  xk )(r1 )T to the particle velocity at the iteration n ,  2 (n) – parameter controlling the contribution of the component ( x*  xk )(r2 )T to the particle velocity at the iteration n , w(n) – parameter controlling the contribution of the particle velocity at iteration n -1 to the particle velocity at iteration n ,  0 – initial value of parameters 1(n) and  2 (n) , w0 – initial value of parameter w(n) ,

The simulated annealing introduced in this work makes it possible to establish an inverse relationship between parameters  1(n) ,  2 (n) , w(n) and the iteration number, i.e. at the initial iterations, the entire search space is explored (in this case, the Cauchy distribution is used), and at the final iterations, the search becomes directional (in this case, the normal distribution is used). In addition, in this work, a direct relationship was established between parameters T0 and  and the iteration number, which makes it possible to automate the selection of these parameters. 1 2

The choice of initial values  0  0.5  ln 2 and w0 

is standard and satisfies the conditions for the particle swarm convergence w  1 and w0 

Block 2.4 – Вычисление speed of each particle Block 3 – Modification of the position of each particle Block 3.1 Limiting the speed of each particle

vk  w(n)vk  1 (n)( xkbest  xk )(r1 )T  2 (n)( x*  xk )(r2 )T , k 1, K , Block 3.2 – Calculating position of each particle

vkj vkj   0, vkj  ( x mjin , x mjax ) vkj {x mjin , x mjax } , k 1, K , j 1, M .

xk  xk  vk , k 1, K , xkj  xkj , x mjax , xkj  x mjax x mjin , xkj  x mjin xkj  ( x mjin , x mjax ) , k 1, K , j 1, M , Block 4 - Determination of the personal (local) best position of each particle

If F (xk )  F (xkbest ) , then xkbest  xk , k 1, K .

Block 5 - Determination of the particle of the current population with the best position k *  arg min F (xk ) .

k1,K

If F (xk* )  F ( x* ) , то x*  xk* .

Block 6 - Determining the global best position

Block 7 - Stop condition

If n  N , then increase the iteration number n by one and go to block 2.

8. Experiments and results

Modeling of the process of the neural network model values determination was carried out in the Matlab package using Parallel Computing Toolbox. Since the formation of each particle in block 1, the modification of the speed, position and local best position of each particle in blocks 2-4, respectively, occurs independently of other particles, and the order of formation and modification of particles is arbitrary, it is proposed to perform parallel processing of particles using a parallel parfor loop. Parfor is part of Parallel Computing Toolbox, replaces the sequential for loop and is based on OpenMP technology, but unlike it, it can be used not only on a local multicore machine, but also on a cluster. The advantage of this approach over the CUDA and MPI technologies (represented in the Parallel Computing Toolbox by the spmd block) is the simplicity and clarity of the technical implementation. Due to the small number of particles, it becomes possible to perform the formation and modification of each particle on the corresponding physical core of the machines processors united in a cluster.

Swarm size was selected as K =40.

To determine the type of distribution used in the SAPSO method, a number of experiments were carried out, the results of which are presented in Table 1.

Table 1 Comparative characteristics of distribution types

Distribution type Criterion

Number of iterations

U( 0,1 )

According to Table 1, the distribution U( 0,1 ) requires the least number of iterations while maintaining the required forecast accuracy.

To define the structure of a long-term forecast model based on FC-ESN, i.e. determining the number of hidden neurons, a number of experiments were carried out, the results of which are presented in Figure 4.

A sample of values based on data from the logistics company Ekol Ukraine was used as input data to determine the parameters values of the neural network model for the long-term forecast. The criterion for choosing the structure of the neural network model was the minimum mean squared forecast error. As can be seen from Figure 4, with an increase in the number of hidden neurons, the error value decreases. It is sufficient to use 16 neurons in the hidden layer for the forecast, since with a further increase in the number of neurons in the hidden layer, the change in the error value is insignificant.

The neural networks for long-term forecasting were investigated in the work according to the criterion of the minimum mean squared error (MSE) of the forecast and computational complexity (Table 2), where M (k) – the number of unit delays for the kth layer, S – the number of cell, N ( 1 ) – the number of neurons in the first layer, P – training set cardinality, N – number of iterations of the multi-agent metaheuristic method SAPSO, N << P , N ( 1 )  P .

According to Table 2, FC-ESN type 2 has the highest forecast accuracy, and Type 1 FC-ESN network has the lowest computational complexity.

Based on the performed experiments, the following conclusions can be drawn.

The LSTM network has average learning rates and forecast accuracy.

Table 2 Comparative characteristics of neural networks for long‐term forecast

Network Criterion Minimum MSE of the forecast Computational complexity

Full LSTM 0.12

GRU

The GRU network is second only to the author's networks in learning speed (it uses a gradient learning method and less computational complexity than LSTM and ESN). But it has the least prediction accuracy (due to the gradient learning method and a simplified architecture compared to LSTM).

ESN networks are inferior in forecast accuracy only to the author's networks, since they are trained on the basis of the pseudoinverse matrix method. But it has the lowest learning rate (it has the highest computational complexity, and the pseudoinverse matrix method does not provide for parallelism).

The author's FC-ESN networks are trained on the basis of the proposed metaheuristic, which increases the forecast accuracy (low probability of hitting the local extremum) and the learning rate (provides parallel learning), and does not have the complex implementation.

9. Conclusions

The article discusses the problem of improving the efficiency of long-term forecasting in the supply chain. To solve this problem, the existing forecasting methods were investigated. These studies have shown that by far the most effective is the use of artificial neural networks. To improve the quality of the long-term forecast, an ESN neural network was chosen, modified (by introducing full connectivity and cascades of unit delays in the input and output layers), and in the course of a numerical study, the structure of its model was determined. The experiments have shown that with 16 hidden neurons, the value of the mean squared error does not change significantly, and the selected network gives forecast results with a minimum deviation. A method was proposed for determining the parameters values of the proposed neural network model for long-term forecast. This allowed to ensure high speed and accuracy of the forecast. The proposed methods are intended for software implementation in the Matlab package using Parallel Computing Toolbox, which speeds up the process of finding a solution. The software implementing the proposed methods was developed and researched on the database of the logistics company Ekol Ukraine. The conducted experiments have confirmed the efficiency of the developed software allowing to recommend it for practical use in solving problems of supply chain management. Prospects for further research are in applying the proposed methods on a wider set of benchmarks. 10.References [6] P. Bidyuk, T. Prosyankina-Zharova, O. Terentiev, Modelling nonlinear nonsta-tionary processes in macroeconomy and finances, in: Z. Hu, S. Petoukhov, I. Dychka, M. He (Eds.), Advances in Computer Science for Engineering and Education. Advances in Intelligent Systems and Computing, volume 754, Springer, Cham, 2019, pp. 735–745. doi: 10.1007/978-3-319-910086_72. [7] L. Lyubchyk, E. Bodyansky, A. Rivtis, Adaptive harmonic components detection and forecasting in wave non-periodic time series using neural networks, in: Proceedings of the ISCDMCI'2002, Evpatoria, 2002, pp. 433-435. [8] K.-L. Du, K. M. S. Swamy, Neural networks and statistical learning, Springer-Verlag, London, 2014. [9] S. Haykin, Neural networks, Pearson Education, New York, NY, 1999. [10] S. N. Sivanandam, S. Sumathi, S. N. Deepa, Introduction to neural networks using Matlab 6.0,

The McGraw-Hill Comp., Inc., New Delhi, 2006. [11] S. Hochreiter, J. Schmidhuber, Long short-term memory, in: Neural Computation, volune 9, 1997, pp. 1735-1780. doi: 10.1162/neco.1997.9.8.1735. [12] F. Gers, Long Short-Term Memory in Recurrent Neural Networks, PhD thesis, Ecole

Polytechnique Federale de Lausanne. [13] K. Cho, B. van Merrienboer, C. Gulcehre, F Bougares, H Schwenk, Y. Bengio, Learning phrase representations using RNN encoder-decoder for statistical machine translation, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, pp. 1724–1734. doi: 10.3115/v1/D14-1179. [14] R. Dey, F. M. Salem, Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks, arXiv:1701.05923, 2017. URL: https://arxiv.org/ftp/arxiv/papers/1701/1701.05923.pdf. [15] E. Fedorov, T. Utkina, О. Nechyporenko, Forecast method for natural language constructions based on a modified gated recursive block, in: CEUR Workshop Proceedings, vol. 2604, 2020, pp. 199-214. [16] A. Graves, G. Wayne, M. Reynolds et al., Hybrid computing using a neural network with dynamic external memory, Nature 538 (2016) 471–476. doi:10.1038/nature20101. [17] R. B. Greve, E. J. Jacobsen, S. Risi, Evolving neural turing machines for reward-based learning, in: Proceedings of the 2016 Genetic and Evolutionary Computation Conference, GECCO’16, ACM, 2016, pp. 117–124. doi: 10.1145/2908812.2908930. [18] H. Jaeger, Tutorial on Training Recurrent Neural Networks, Covering BPPT, RTRL, EKF and the Echo State Network Approach, GMD Report 159, German National Research Center for Information Technology, 2002. [19] H. Jaeger, M. Lukosevicius, D. Popovici, U. Siewert, Optimization and applications of echo state networks with leakyintegrator neurons, in: Neural Networks volume 20, 2007, pp. 335–352. doi:10.1016/j.neunet.2007.04.016. [20] T. Natshlager, W. Maas, H. Markram, The liquid computer: A novel strategy for real-time computing on time series, in: Special Issue on Foundations of Information Processing of Telematik, 2002, pp. 39–43. [21] Q. Wang, P. Li, D-lsm: Deep liquid state machine with unsupervised recurrent reservoir tuning, in Pattern Recognition (ICPR), in: 23rd International Conference on Pattern Recognition (ICPR) (Cancun: IEEE), 2016, pp. 2653–2658. doi: 10.1109/ICPR.2016.7900035. [22] W. Maass, Liquid state machines: motivation, theory, and applications, in: Computability in context: computation and logic in the real world, 2011, pp. 275–296. doi: 10.1142/9781848162778_0008. [23] T. Neskorodieva, E. Fedorov, I. Izonin, Forecast method for audit data analysis by modified liquid state machine, in: CEUR Workshop Proceedings, 2020, volume 2631, pp. 145-158.

[1]

J. F.

Cox , J. G. Schleher, Theory of constraints handbook , New York, NY, McGraw-Hill , 2010 .

[2]

E. M.

Goldratt , My saga to improve production, in: Selected Readings in Constraints Management, Falls Church , VA: APICS , 1996 , pp. 43 - 48 .

[3]

E. M.

Goldratt , Production: The TOC way, including CD-ROM simulator and workbook, Revised edition , Great Barrington, MA: North River Press, 2003 .

[4]

Smerichevska et al, Cluster Policy of Innovative Development of the National Economy: Integration and Infrastructure Aspects: monograph , S. Smerichevska (Eds.), Wydawnictwo naukowe WSPIA , 2020 .

[5]

R. T.

Baillie ,

Kapetanios ,

Papailias , Modified information criteria and selection of long memory time series models , in: Computational Statistics and Data Analysis , volume 76 , 2014 , pp. 116 - 131 . doi: 10 .1016/j.csda. 2013 . 04 .012.