Comparison of Methods for Improving the
     Quality of Prediction Using Artificial Neural
                       Networks

               Kirill Uryvaev[0000−0001−5275−5024] , spore9@yandex.ru
               Alena Rusak[0000−0002−2803−4777] , alena@cde.ifmo.ru

         ITMO University, 49 Kronverksky Pr., St. Petersburg, 197101, Russia


        Abstract. In this paper, the methods for improving the quality of fore-
        casting using artificial neural networks are researched. Nowadays due
        automatisation of many systems, that using prediction algorithms, such
        as artificial neural networks, forecasting error cost are increasing, espe-
        cially in expensive or dangerous areas. In problems with a small set of
        initial data or a large algorithmic complexity, the solution of the fore-
        casting problem can give unsatisfactory results or lead to large amounts
        of computation. The speed and accuracy of prediction are of critical im-
        portance since in many real practical problems the cost of forecasting
        errors is extremely high. There are of methods aimed at improving the
        quality of training of artificial neural networks. The methods selected for
        this paper are not exclude each other, so they can composited without
        any conflict, but some methods in certain situation can worsen the result,
        some methods can increase accuracy, but decrease speed and vice versa.
        And because of that condition, methods should be compared separately
        and only after that composite, and result should improve, because their
        disadvantages compensated by advantages of others methods. The ex-
        periment showed that although individually these methods do not have
        a significant impact on accuracy, a combination of these approaches im-
        proves the quality of forecasting.

        Keywords: Artificial neural networks · Machine learning · Forecasting
        · Nonlinear regression · Extrapolation.


1     Prediction task

Prediction is a non-linear and often quite difficult task, which is obviously rel-
evant nowadays. It nds its application both in purely scientific fields (physics,
chemistry, biology, etc.) and in practice (marketing, statistics, etc.). The essence
of the forecasting problem is to predict the future reaction of the system accord-
ing to its previous behavior. Currently, many approaches have been developed
to solve this problem, among which artificial neural networks (ANN). The main

    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2       K. Uryvaev, A. Rusak

advantage of neural network models is their nonlinearity, i.e. the ability to es-
tablish non-linear relationships between future and current process values, as
well as their adaptability and scalability. The disadvantages of neural network
models are the opacity of the modeling process, the complexity of the choice of
architecture and the complexity of artificial neural network learning.
    The accuracy of forecasting is largely influenced by a careful analysis of the
input data, selection of features, the ability to take into account external fac-
tors, i.e. other processes occurring in parallel with the researching process and
affecting it. In problems with a small set of initial data or a large algorithmic
complexity, the solution of the prediction problem can give unsatisfactory results
or lead to large amounts of computation. At the same time, taking into account
the ongoing automation, in systems whose functionality covers important and
dangerous areas, the price of forecasting error is of critical importance. One ap-
proach to improving forecast accuracy is the use of combined models, for exam-
ple, the use of ANN in combination with autoregressive models [1]. Another way
to increase accuracy is to use a consensus forecast, i.e. forecast, which is a linear
combination of several independent forecasts [2]. However, these approaches are
rather laborious, because require the development of several forecasting models.
In this work, we study the influence on the forecast accuracy of some standard
methods used in the training of neural networks.
    For ANN, forecasting is the task of nonlinear regression. Having information
about the values of the variable x at the moments preceding the prediction of
x(k-1), x(k-2), ..., x(k-N), the network makes a decision on what will be the most
probable value of the sequence at the current moment k. To adapt the network
weighting coefficients, the actual forecasting error  = x(k)-x̄(k) and the values
of this error at previous time instants are used [3].
    Despite the advantages, neural networks also have minuses. To successfully
solve the problem, the neural network needs to be trained at a accepted level,
which requires sufficient availability of data, time and computing resources. But
even under such conditions, training may be wrong, for example, a network may
overfit. This means that the network adapts too much to the answers in the
training set, and all other values are very likely to be incorrectly predicted.
    But existing a methods that can help smooth the artificial networks problems.
Although there are many methods to improve the results of ANNs, not all of them
are universal, and what works well, for example, to classify images, may not work
at all for predicting time series. Therefore, it is necessary first to consider the
effectiveness of these methods in general, and then for the forecasting problem.


2   Methods for improving prediction quality

The learning outcomes are greatly influenced by the selection of the initial values
of the network weights. The wrong choice of a range of random values of weights
can lead to an excessive slowdown in the learning process. Xavier’s initializa-
tion (also known as glorot) is to generate random initial bond weights based
on the number of input and output links of a given neuron. This method allows
                            Research of methods for improving a prediction       3

you to accelerate the training of ANN [4] and improves the quality of prediction
because, with several training iterations, the ANN does not fall into the same
local minimum. The initial bond weight is calculated by (1).
                               √                      √
                                 6                      6
            Wi U (− √                       ,√                     )         (1)
                       Countin + Countout      Countin + Countout

where Wi is the neuron connection weight, U is the uniform distribution, Countin
is the number of neuron inputs, Countout is the number of neuron outputs.
    Data shuffle is a random arrangement of data sets in the training and test
samples. This method avoids the dependence of the result on the sequence of
data. However, this technique is not always suitable for forecasting problems [5],
since in such problems the data often are sequence-dependent.
    To prevent overfitting when building a neural network, various regulariza-
tion methods are used. One approach is to introduce an additional term (regu-
larizer) in the objective function, which does not allow the scales to acquire very
large values. The regulator may have the following form (2).
                                                  n
                                              λ X 2
                          regularization =         W                           (2)
                                             2m j=1 j

where m is the sample size, λ is the regularization coefficient, n is the number
of neuron connections, Wj is the weight of the connection.
    In the theory of neural networks, such a regularization is called weight decay,
because it leads to a decrease in their absolute values. There is another regu-
larization method called ”dropout”. Dropout is a technique for addressing this
problem. The key idea is to randomly drop units (along with their connections)
from the neural network during training. This prevents units from co-adapting
too much. During training, dropout samples from an exponential number of dif-
ferent “thinned” networks. At test time, it is easy to approximate the effect of
averaging the predictions of all these thinned networks by simply using a single
unthinned network that has smaller weights. This significantly reduces overfitting
and gives major improvements over other regularization methods. We show that
dropout improves the performance of neural networks on supervised learning
tasks in vision, speech recognition, document classification and computational
biology, obtaining state-of-the-art results on many benchmark data sets. [6].
    Using an adaptive learning coefficient allows you to vary the learning step
depending on the situation. To change the coefficient, you must have a criterion,
the simplest and most obvious is the use of the cost function. This implies that
when the network starts to diverge, you can try to reduce the step to achieve a
more accurate minimum value [7]. Or vice versa, you can increase the step if, for
example, we are sure that the network has hit the local minimum, and it should
look for a solution further.
    Currently, adaptive optimization methods are widely used to configure neural
networks. The following training algorithms were compared: Adadelta, Adagrad,
Adam, Adagrad, Adagrad with low LR, Adamax, RMSprop, SGD (see Fig. 1).
4       K. Uryvaev, A. Rusak

   This result was obtained by using ResNet50 with pre-trained weights of im-
agenet. The input shape is (224, 224, 3). As can be seen, Adamax showed the
best result, but this study was conducted on a large scale of data, and Nadam
was not researched because of memory problems with it.


    Fig. 1: Comparison of adaptive learning coefficient optimization methods.


    The pruning method consists in equating to zero the most uninformative
relations with relative to other relations. This method is more controversial be-
cause it often worsens the result because can nullify important connections, but
at the same time in large networks, it can significantly simplify calculations [8].
Also, to implement pruning, it is necessary to use sparse matrices, i.e. matrices
with lots of zeros.
                           Research of methods for improving a prediction     5

   Sparse matrix is a matrix in which most of the elements are zero. Sparse
matrices, by itself, without pruning, can improve the results of the ANN [9]. On
small networks, they accelerate learning, and on large networks they allow you
to make calculations faster [10] (see Fig. 2). Large sparse matrices are common


             Fig. 2: Comparison of sparse and dense matrices [10].


in datasets that contains counts, data encodings that map categories to counts,
and even in whole subfields of natural language processing.
    It is computationally expensive to represent and work with sparse matri-
ces as though they are dense, and much improvement in performance can be
achieved by using representations and operations that specifically handle the
matrix sparsity.


3   Testing the methods

To check the described methods, tests were performed on the Building prob-
lem from the Proben1 set [11]. The set contains data for predicting the hourly
consumption of electricity, hot and cold water based on the date, time of day
and weather data. Complete hourly data for four consecutive months is used for
training, and output data for the following two months should be predicted. In
all there are 14 inputs and 3 outputs. To implement it, the Keras framework on
TensorFlow was selected and ANN architecture is: 2 dense layers, first with 64
neurons, second with 32 neurons. Mean absolute percentage error (MAPE) (3)
will be used as accuracy
                                        n
                                     1 X At − Ft
                          M AP E =                                           (3)
                                     n t=1 At

   where At is actual value, Ft is forecast value, n is values set size.
   To evaluate the quality of the forecast, the determination coefficient R2 (4)
are used, which is considered as a universal measure of the dependence of one
6      K. Uryvaev, A. Rusak

random variable on many others. It takes values from 0 to 1, the closer the coef-
ficient is to 1, the stronger the dependence. When evaluating regression models,
this is interpreted as matching the model with data.

                                              SSres
                                 R2 = 1 −                                     (4)
                                              SStot
where SSr eg (5) is the sum of explained squares , SSt ot (6) is the total sum of
squares.
                                        n
                                        X
                              SSres =     (yi − ȳ)2                          (5)
                                        i=1

where yi is actual value, ȳ is the mean value.
                                        n
                                        X
                              SStot =         (yi − ŷ)2                      (6)
                                        i=1

   where yi is actual value, ŷ is the predicted value. After training and solving
the problem on the test sample, the next results were obtained (Table 1).


                 Table 1: Results of Building problem solving.
ANN type              Loss function Accuracy (MAPE) Determination coef.
Initial ANN           0.0542        0.9421          0.7006
Xavier initialization 0.044         0.9566          0.7501
Regularization        0.0632        0.9297          0.6953
Data shuffle          0.0319        0.9704          0.7871
Pruning               0.1516        0.8761          0.5685
Sparse Matrices       0.0337        0.9661          0.8348
Method Composition 0.0138           0.9887          0.8843


   In benchmark they obtained result of 0.92 on validation set, which is lower
than all results except of pruning. This can be explained by the fact that the
task was solved by old artificial neural network, so even the inital Keras ANN
did it better. Also it took for their neural network about 400 epoch to learn,
while our ANN has learned only with 100 epochs.
   For comparison, the forecasting problem with the help of ANN was solved
without using methods for improving the quality of forecasting. (Fig. 3a) shows
a graph of the learning speed of all networks, and (Fig. 3b) the accuracy of this
networks. The problem was solved with fairly good accuracy, but the determi-
nation coefficient was only 0.7.
   In the researching problem were not a big set of zeros, i.e. the matrices were
not sparse, therefore, they did not have a big impact on the result, but at the
same time, the calculations themselves were reduced, albeit slightly (due to the
small size of the problem), 159.546 seconds versus 157.486 seconds. In this task,
                              Research of methods for improving a prediction         7


                    (a)
                                                               (b)

            Fig. 3: Methods comparison (a) learn curve (b) accuracy


the pruning worsened the result, because there were not a large number of zeros,
and the task was generally not large, but now the model can be compressed
without losses due to zeros, the array of which is easy to compress.
    Applying the composition of all methods except for pruning, the best re-
sult was obtained: the forecasting accuracy increased by about 5.6%, while the
determination coefficient was 0.8843.
    Thus, the study showed that despite the fact that separately the standard
methods used for tuning neural networks did not show a significant increase in
the accuracy of forecasting, the use of a combination of these methods allowed
us to achieve a fairly high forecast accuracy (97.26%). However, when solving
problems using ANN, it is necessary to take into account its specificity and, on
the basis of this, select more controversial methods such as pruning.


References

 1. Bunnoon, P., Chalermyanont, K., Limsakul, C.: A Computing Model of Artifi-
    cial Intelligent Approaches to Mid-term Load Forecasting: a state-of-the-art-survey
    for the researcher. IACSIT International Journal of Engineering and Technology,
    (2010)
 2. Olivia, R., Watson, N.: Managing Functional Biases in Organizational Forecasts: A
    Case Study of Consensus Forecasting in Supply Chain Planning. Production and
    Operations Management Society, 138-151 (2009)
 3. Osovsky S.: Neural networks for information processing. 2nd edn. Finance and
    Statistics, Moscow (2017)
 4. LeCun Y., Bottou L., Orr G., Muller K.: E Efficient BackProp. Neural Networks:
    Tricks of the trade, Springer , 9-48 (1998)
 5. How to Backtest Machine Learning Models for Time Series Forecasting,
    https://machinelearningmastery.com/backtest-machine-learning-models-time-
    series-forecasting. Last accessed 14 Oct 2019
8       K. Uryvaev, A. Rusak

 6. Hinton G., Srivastava N., Krizhevsky A., Sutskever I., Salakhutdinov R.: Improving
    neural networks by preventing co-adaptation of feature detectors. (2012)
 7. Moreira M., Fiesler E.: Neural Networks with Adaptive Learning Rate and Mo-
    mentum Terms. IDIAP Technical report , (1995)
 8. Pruning deep neural networks to make them fast and small,
    https://jacobgil.github.io/deeplearning/pruning-deep-learning. Last accessed
    19 Oct 2019
 9. Sparse        Matrices        For        Efficient        Machine         Learning,
    https://dziganto.github.io/Sparse-Matrices-For-Efficient-Machine-Learning/.
    Last accessed 23 Oct 2019
10. Changpinyo S., Sandler M., Zhmoginov A.: The Power of Sparsity in Convolutional
    Neural Networks. ICLR 2017 , (2017)
11. Prechelt L.: Proben1: A Set of Neural Network Benchmark Problems and Bench-
    marking Rules. In: Technical Report 21, Karlsruhe, Germany (1994)