Integration of Big Data Processing Tools and Neural
                      Networks for Image Classification
               Nikita E. Kosykh                                          Anatoly D. Khomonenko
      Emperor Alexander I St. Petersburg                           Emperor Alexander I St. Petersburg State
         State Transport University,                                       Transport University,
          Saint Petersburg, Russia                                       Saint Petersburg, Russia
           nikitosagi@mail.ru                                              khomon@mail.ru

              Alexander P. Bochkov                                            Anatoly V. Kikot
          Peter the Great St. Petersburg                           Emperor Alexander I St. Petersburg State
             Polytechnic University,                                       Transport University,
            Saint Petersburg, Russia                                     Saint Petersburg, Russia
                 kostpea@mail.ru                                            a.v.kikot@yandex.ru


                                                                     Deep learning of neural networks is a very popular
                                                                     topic these days, especially in computer vision
                                                                     applications. Until 2010, there was no
                                                                     database[Liu16]        sufficiently   complete   and
                         Abstract                                    voluminous in order to train a neural network in a
                                                                     high-quality way to solve certain problems related
    The issues of joint use of tools for                             to image recognition. Existing solutions had a high
    processing big data in solving problems of                       level of inaccuracy. With the emergence of open
    artificial intelligence are becoming
                                                                     databases, such as Kaggle (big data bank) and
    increasingly important. The article
                                                                     ImageNet (image bank), it is possible to train
    discusses the task of optimizing the
                                                                     neural networks and make practically error-free
    parameters of neural networks used for
    image recognition using Matlab and                               decisions. The article discusses the task of
    Hadoop systems, as well as the MNIST                             optimizing the parameters of neural networks used
    (Modified National Institute of Standards                        for image recognition using Matlab and Hadoop
    and Technology) database is a voluminous                         systems.
    database of handwritten number samples.                          Justification of the choice of programming
    The results of calculating the optimal                           environment. The Matlab 2016a[Kra18] chosen as
    number of neural network layers for                              an integrated programming environment for
    solving the classification problem of                            working with big data. It supports some tools for
    images presented. We study the issues of                         solving problems, namely:
    evaluating the accuracy of image                                     • Variables stored on the hard disk. Using the
    classification depending on the number of                                matfile function, you can access MATLAB
    network neurons, choosing the optimal                                    variables directly from the MAT file on disk,
    network training algorithm, and evaluating                               without loading the entire variable into
    the effect of parallelization[Sha09] using
                                                                             memory. This will allow processing of large
    MATLAB Distributed Computing Server in
                                                                             data sets that do not fit in memory.
    the process of training a neural network on
    computing performance.                                               • Datastore. The datastore function allows
                                                                             you to access data that does not fit in
                                                                             memory. This may include data from files,
1 Introduction                                                               sets or tables.
    1The issues of joint use of tools for processing                     • Parallel computing. Parallel Computing
                                                                             Toolbox implements parallel loops to run
big data[Bah15] in solving problems of artificial
                                                                             MATLAB code on multi-core architectures.
intelligence are becoming increasingly important
                                                                         Machine learning. Machine learning used to
Copyright c by the paper's authors. Use permitted under Creative
                                                                     develop predictive models for big data. The whole
Commons License Attribution 4.0 International (CC BY 4.0).           range of machine learning algorithms is contained
 In: A. Khomonenko, B. Sokolov, K. Ivanova (eds.): Selected          in     the     Statistic     Toolbox    and    Neural
Papers of the Models and Methods of Information Systems Research
Workshop, St. Petersburg, Russia, 4-5 Dec. 2019, published at
                                                                     Network[Pao89] Toolbox modules.
http://ceur-ws.org.
                                                                                                                   52
   •    hadoop. Using the MapReduce[Gha15] and             To display an image, you need to create a
DataStore functionality built into MATLAB, you can     graphic context (window) using the function figure.
develop algorithms on a personal computer and          As parameters, you can pass properties for a
run them on Hadoop. You can request a piece of         graphic context to a function. Initially handwritten
data using the datastore function, and then using      numbers are displayed in colors that are presented
the Distributed Computer Server, run the               in a color palette of green and blue shades (Fig. 1).
algorithms within the Hadoop MapReduce
environment on the complete set of data.

2 MNIST Dataset of Digit Patterns
The MNIST (Modified National Institute of
Standards and Technology) database is a
voluminous database of handwritten number
samples. The database is a standard proposed by
the US National Institute of Standards and
Technology for the purpose of calibrating and
comparing image recognition methods using
machine learning, primarily based on neural
networks.
    Neural networks tend to learn better from                      Figure. 1: Display the default set
specific examples. In the development of                   For a better perception, we use the colormap ()
neural[Nov15]       network      for     pattern       function, which sets the color map of the final
recognition, we use the popular MNIST                  image.
handwritten data set (Hadoop, 2016). Kaggle's              Colormap (gray) – sets a linear palette in shades
open portal for distributing big data [Peh19]          of gray.
uses this very kit in the Digit Recognizer                 The following code fragment reads the first 16
training contest. The set contains the following       lines of the data set, which are 16 digits. Converts
components:                                            rows into matrices and displays the resulting
         a) trainSet.csv – training data;              images on the screen, while the numbers on the top
         b) testSet.csv – test data for                show the object class numbers.
             presentation.
    First you need to load the training data           figure
                                                       colormap(gray)
into MATLAB (MATLAB, 2018). For this we use
                                                       for i = 1:25
the built-in function csvread.                         subplot(5,5,i)
                                                          matrix = reshape(trainSet(i,             2:end),
M = csvread (filename, R1, C1)                            [28,28])'
trainSet = csvread (trainSet.csv, 1,0)                 imagesc(matrix)
                                                       title(num2str(tr(i, 1)))
testSet = csvread (testSet.csv, 1,0)
                                                       end
trainSet = csvread ('trainSet.csv', 1,0)
subSet = csvread ('testSet.csv', 1, 0);
                                                          After the above code fragment executed, an
                                                       image of 16 handwritten numbers will appear on
    The first column in the trainset set is a label    the screen, placed on a single graphic screen in a
that shows the correct number for each sample in       black and white palette (Fig. 2).
the data set, and each line is a sample. In the
remaining columns, the row is an image of the
handwritten digit 28x28, but all the pixels are
placed in one row, and not in the original
rectangular shape. To render the numbers, we need
to rebuild the rows into 28x28 matrices. To do this,
you can use the Reshape function, with the
exception that you need to transpose the matrix,
because the Reshape function works in columns
and not line by line.


                                                             Figure. 2: Image handwritten numbers
                                                                                                        53
    To build the network[Kho15], we will use           D. For the input, choose XtrainSet and for
the tool for pattern recognition ‘nprtool’ from    classes, YtrainSet.
the Neural Network Toolbox module.                     E. After installing the arrays, go to the
                                                   “Data for verification” section. In this case, you
3 Data preparation                                 can leave the default values. This will divide
The input tool expects two sets of data:           the data in the ratio of 70-15-15 into sets for
                                                   training, testing and testing.
   a) input – a numeric matrix, each column            F. In the network architecture section,
      of which represents samples and rows.        change the value of hidden layers to 100 and
      These are scanned images of                  go on.
      handwritten numbers;                             G. For training, go to the Train Network
   b) vectorLabels – matrix-row binary form        section and click “Train” to start training. Upon
      of 0 and 1, which is mapped to specific      completion of the operation, a window with
      labels that represent the image. It is       learning results will open. Numerical
      also called a dummy variable. Neural         indicators can viewed as graphs and charts.
      Network ToolBox also expects tags to             H. On the last tab, you will prompted to
      be stored in columns and not in rows.        save the learning script for the model you just
    Class labels range from 0 to 9, as there are   created. (eg NeuralNetworkSctipt.m).
exactly so many single-digit numbers. To solve         Below is a diagram of a model of an
this problem, we will use “10” instead of “0”,     artificial neural network (Figure 3.), which was
because MATLAB indexing starts from 1.             created using the pattern recognition tool. It
    The trainSet dataset stores sample images      has 784 input neurons, 100 hidden layer
in rows, not columns, so the entire dataset        neurons and 10 output word neurons (which is
needs to transpose. The test set of data does      equal to the number of classes for prediction).
not come out in advance in the MNIST set. To
form it, we will hold 1/3 of each data set
(training, test).
n = size(trainSet, 1);
labels = trainSet(:,1);
labels(labels == 0) = 10;                                 Figure. 3: Neural network model with
labelsd = dummyvar(labels);                                            parameters
inputs = trainSet(:,2:end)
inputs = inputs'; % transposition of the              The model trained by adjusting the weights
set of predictors                                  for correct results. W in the diagram denotes
labels = labels';                                  weights and b – displacement neuron, which
labelsd = labelsd';                                are part of individual neurons. Separate
rng (1);
c = cvpartition (n, 'Holdout', n / 3);             neurons in the hidden layer are as follows: 784
XtrainSet = inputs (:, training (c));              input neurons and correspondingly the same
YtrainSet = labelsd (:, training (c));             weights, 1 offset unit and 10 activation
XtestSet = inputs (:, test (c));                   outputs.
YtestSet = labels (test (c));
YtestSetd = labelsd (:, test (c));
                                                   5 Data Visualization
     Explanation: The function c = cvpartition     For begin you should look inside the structure
(n, ‘Holdout, p) randomly creates a section for    of the NeuralNetworkSctipt.m file. There you
validating validation during observations. This    can see the generated variables, such as IW1_1
section divides the array into tutorials and a     and x1_step1_keep, which are weights vectors
test set. Parameter p must be a scalar. When 0     obtained in the process of training the
<p <1, cvpartition randomly selects n cases for    network. Since we have 784 inputs and 100
the test set. The default is p = 1/10.             neurons, the complete hidden layer will consist
4 Using the Neural Network Toolboox                of a 100x784 matrix. Now you can clearly see
                                                   what         they       study         neurons.
The following describes the steps to work with     Copy the above variables from the file and
the Matlab NNTool.                                 enter them into the workspace in the form of
                                                   matrices.
   A. To start working with NNT, call the
                                                   W1 = zeros (100, 28 * 28);
nprtool method from the command window.
                                                   W1 (:, x1_step1_keep) = IW1_1;
   B. In the next menu, select Pattern
Recognition Tool to open the pattern               Figure
recognition tool.                                  colormap (gray)
   C. On the welcome screen, go to “Select         for i = 1:16
data”.                                             subplot (4,4, i)

                                                                                                  54
digit = reshape (W1 (i, :), [28,28])';                     written in the layers array. Further, in the loop, it
imagesc (digit)                                            sets the values and the results of the accuracy of
end                                                        the network estimate and writes them into the
                                                           Accuracy variable. Let as save a set of commands to
                                                           the nnscript.mlx script, so that we can repeat
6 Evaluation Of Classification Accuracy                    requests.
Now you can use the previously prepared file to
predict the classes in the part of the Xtest file held     layers = [15,50:30:290];
and compare them with the actual classes in the            scores = zeros(length(layers), 1);
                                                           models = cell(length(layers), 1);
Ytest set ‘. This gives a real picture of the              for i = 1:length(layers)
performance      against    the     background      of         hiddenLayerSize = layers(i);
unclassified data. After that we will execute the              nn = patternnet(hiddenLayerSize)
commands that are from the 10x14000 matrix,                    nn.divideParam.trainRatio = 0.7;
convert to 1x14000 by selecting only the values                nn.divideParam.valRatio = 0.15;
with the maximum probability.                                  nn.divideParam.testRatio = 0.15;
                                                               nn = train(nn,XtrainSet,YtrainSet);
% predicts the probability for label                           p = nn(XtestSet)
YpredSet =                                                     models{i} = nn,
neuralNetworkFunction(XtestSet);                               [~,p] = max(p)
% display the first 5 columns                                  AccuracyScores(i) =
YpredSet (:, 1: 5)                                         sum(YtestSet==p)/length(YtestSet)
                                                           end
[~, YpredSet] = max(YpredSet);
    It remains to compare the predicted network            figure
indices and actual. To do this, we apply the formula       plot(layers, AccuracyScores, 'o-')
                                                           xlabel('Number of neurons')
for assessing the quality of the algorithm, namely,        ylabel(’Accuracy of classification’)
assessing the accuracy of data classification.             title('Number of neurons / accuracy')
                                 P                            As a result, of the execution of a sequence of
                  Accuracy =       ,                       commands, we obtain a graph of the dependence of
                                 N                         the accuracy of the algorithm on the number of
where: N is the size of the training sample; P is the      neurons in the hidden layer (fig. 4).
number of objects from the sample for which the
network made the right decision.
   This formula must have interpreted into MatLab
code and our data sets, namely:
                                                                Accuracy of classification


Accuracy = sum (YtestSet == YpredSet) /
length (YtestSet) % compare predicted
and actual
Accuracy = 0.9554.

   The prediction accuracy rating is 95.5%, which
is good enough for a network with a minimum
number of manual settings.
                                                                                             Number of neurons

                                                                 Figure. 4: The dependence of the classifier
7 Calculation The Optimal Number Of                                accuracy on the number of neurons
Layers                                                        Analyzing the graph, we can conclude that the
It now remains to find out how the change in the           best result will be about 145 neurons, with an
number of hidden neutrons will affect the accuracy         accuracy of 0.957, then, with an increase in the
of data classification. The main thing is to fulfill the   number of neurons in the hidden layer, the
inequality                                                 accuracy remains the same, and performance
                                                           begins to decline markedly.
     Inputs  hiddens  outputs
In the previous experiment, 100 neurons of the
hidden layer were involved, we will try to find the
optimal value at which the accuracy factor will
improved, but there will be no effect of
reconfiguring the system.
   As input parameters, we will set the number of
neurons in the hidden layer. The amount will vary
from 10 to 300 in increments of 25. All values will
                                                                                                                 55
   Note that we obtain greater accuracy with an          end
increase in the number of neurons, but at some               First, an array of cells is created to store future
point the accuracy may fluctuate in the negative         models of trained networks. Each model contains a
direction (due to the accidental initialization of the   trained network on a specific algorithm. Next, we
weights). As the number of neurons increases, the        select the variable that will store the accuracy
model can capture more functions, but because of         estimates of the algorithm, the name of the
their excess, you can eventually retrain your model      algorithm, and the network training time
on one set, and this will have a bad effect on the           The code in the loop performs network training
classification of new data.                              using a constant number of layers. Data for training
   Consider the question of justifying the choice of     is divided in the classical proportion 70/15/15. We
the optimal learning algorithm.                          loop through the data into the array of
8 Selection of the optimal learning                      accuracyRate cells.
                                                             To get the results in a time-sorted form, execute
algorithm                                                the following command:
The following algorithms were chosen for                     sortrows (accuracyRate, [3])
network training, which are presented in Table               The test results of learning algorithms with no
1. The preliminary selection was made on the             parallel computing [39] are presented in table. 2
basis of the performance studies of algorithms
                                                                 Table 2: Algorithm Test Results for
for solving various typical problems by the
                                                                      Conventional Computing
MathWorks group.
                                                           Learning          Classification        Learning
         Table 1: Neural Network Learning
                                                           Algorithm           Accuracy            Time, sec.
                    Algorithms
                                                         ‘trainrp’        0.9038                 71.0844
Function Matlab               Algorithm                  ‘trainscg’       0.9566                 149.0134
‘trainscg’         Stochastic Gradient Descent           ‘traincgp’       0.9558                 249.0219
‘trainrp’          Resilient Propagation (Rprop)         ‘traincgb’       0.9589                 313.0135
‘traincgf’         Fletcher-Powell        related        ‘traincgf’       0.9612                 409.3543
                   gradient method                       ‘trainoss’       0.9588                 845.4352
‘traincgb’         Powell-Bill related gradient
                   method                                   The first column contains the names of the
'traincgp'         Polac-Ryber     method      of        algorithms, the second – the accuracy of data
                   associated gradients                  classification, the third network training time in
‘trainoss’         One-step algorithm of the             seconds. As we can see, the Rrop turned out to be
                   cutting planes method                 the fastest method for learning, but the accuracy is
                                                         poor. The optimal approach, in terms of execution
funcArray={'trainscg','trainrp','traincg                 time and accuracy, is stochastic gradient descent
b','traincgp','traincgf','trainoss'}'
                                                         with a result of 95.6%.
    All   computational     processes  were              9 Distributed Learning Computing
performed on a single personal computer
without the use of distributed computing, i.e.           Parallel Computing Toolbox allows training and
without additional optimization.                         building a neural network using multiple processor
                                                         cores on a single PC or on multiple network
model = cell(length(funcArray),1)
accuracyRate = cell(length(funcArray),                   computers using the MATLAB Distributed
2);                                                      Computing Server.
for i = 1:length(funcArray)
                                                            Using multiple cores can speed up calculations.
net = patternnet(145,funcArray{i,1})
net.divideParam.trainRatio=0.7                           Using multiple computers can solve the problem of
net.divideParam.valRatio=0.15                            lack of RAM to accommodate too large data sets for
net.divideParam.testRatio = 0.15;                        one computer.
rng(1);
tic % timer start                                           The goal is to use the tool and identify patterns
net = train(net,XtrainSet,YtrainSet)                     between the number of cores involved and the
toc% timer stop                                          network learning rate on the above algorithms.
timer1 = toc
model{i} = net                                              To manage cluster configurations, the Cluster
p = net(XtestSet);                                       Profile Manager is used.
[~,p] = max(p)
accuracyRate{i,1} = funcArray{i}                            To open the pool of MATLAB workers, enable
accuracyRate{i,2}=sum(YtestSet==p)/leng                  the default cluster profile, which refers to the local
th(YtestSet)                                             CPU core, use the following command:
accuracyRate{i,3} = timer1
                                                         pool = parpool
                                                                                                        56
    You must also indicate the number of
workstations involved, or in our case the cores,
by calling the command:
pool.NumWorkers
     Now we can train the neural network by
sharing data among the CPU cores. To do this,
set the parameters for the training and testing
network functions.
net = train(net, XtrainSet,
YtrainSet, ‘UseParallel,‘ Yes ’)
p = net(XtestSet, ‘UseParallel,‘
Yes ’);
   By starting the module using the
‘ShowResources’ argument, you can verify that            Figure. 5: Comparison of 2CPU and CPU
the calculations are performed on several                             performance
cores.
                                                       Note that the effect of using distributed
net = train(net, XtrainSet,                        computing becomes more noticeable when
YtrainSet, 'useParallel', 'yes',                   using algorithms with greater convergence and
'showResources', 'yes')                            requiring more iterations to obtain a result.
p = net(XtestSet, 'useParallel',                       A multilayer neural network has been built
'yes', 'showResources', 'yes');                    to solve the problem of machine learning,
                                                   namely, optical recognition of handwritten
    MATLAB indicates which resources were          characters based on the training data set
used.                                              MNIST.
   When the training and testing methods of            In a number of experiments, optimal
the network are called, they divide the input      parameters (the number of hidden neurons,
data into distributed composite values, after      learning algorithm) were determined to
performing the operations, they transform the      achieve good accuracy of the data classification
data back into an array view into the original     algorithm and network training time.
representation in the form of a matrix or an           The trainrp function is the fastest pattern
array of cells.                                    recognition algorithm. Its performance also
                                                   deteriorates as the error value decreases. The
   Here are the results of comparing the           memory requirements for this algorithm are
performance of computing the same                  relatively small compared to others.
algorithms for training a neural network, but          In particular, trainscg, the stochastic
using two physical cores to parallelize            gradient descent algorithm, seems to have
operations (Table 3).                              done a good job with a large number of
       Table 3: Using Parallel Computing for       weights. SCG works almost as fast with pattern
                 Network Learning                  recognition as RProp, but performance does
                                                   not decrease when error is reduced.
   Algori            Accura             Time,          Other algorithms become very slowly with
   thm               cy                 sec.       an increase in the number of neurons in the
   ‘trainr           0.9038             87.6       network, however, they can be useful in
   p’                                              situations where a slower convergence of the
   ‘trains           0.9566             187.3      function is required.
   cg’                                                 When using tools for parallel[Sha09]
   ‘trainc           0.9558             268.9      distributed computing, there is an increase in
   gp’                                             productivity in the learning speed for complex
   ‘trainc           0.9589             330.9      algorithms, based on connected gradients.
   gb’                                             Observations must be carried out with a
   ‘trainc           0.9612             371.6      minimum software load on the disk array;
   gf’                                             otherwise the results will vary greatly during
   ‘traino           0.9588             638.5      idle and peak loads.
   ss’                                                 The stochastic gradient descent algorithm
                                                   is optimal for the task of pattern recognition
   At least we will build a Matlab bar chart       based on neural networks with an average
comparing the network training time using one      number of input neurons. Unlike other
core and two processor cores (Fig. 5) of a         algorithms, its performance does not decrease
personal computer.                                 with a decrease in error.

                                                                                                57
    When using tools for parallel distributed       [Kra18] Krasnovidov, A.V., Khomonenko, A.D.,
computing, there is an increase in productivity               Zabrodin, A.V., Smirnov, A.V. On the
in the learning speed for modified algorithms,                peculiarities of the exchange of data
based on conjugate gradients.                                 between applications in high-level
    At the preparatory stage, prior to building a             languages and Matlab functions.
neural network, the Apache Hadoop                             CEUR      Workshop        Proceedings.
framework is deployed for the task of storing                 Workshop Computer Science and
data in a pseudo-cluster and providing access                 Engineering in the framework of the
to data through the Java interface.                           5     th   International    Scientific-
    A common advantage of the approach is its                 Methodical Conference "Problems of
versatility and the ability to integrate with the             Mathematical and Natural-Scientific
existing infrastructure of the enterprise and its             Training in Engineering Education.
cloud storage, and services. Due to the                       St. Petersburg, Russia, November 8-
abundance of existing libraries for working                   9, 2018. Vol. 2341, pp. 33-41.
with neural networks and big data, the idea can     [Liu16] Liu J. Rethinking big data: A review on
interpreted for any high-level programming                    the data quality and usage issues.
language with the appropriate qualifications of               ISPRS Journal of Photogrammetry
a programmer.                                                 and Remote Sensing. 2016. Vol. 115.
                                                              Pp. 134–142.
10 Conclusion                                       [Nov15] Novikov P.A., Khomonenko A.D.,
                                                              Yakovlev E.L. Justification of the
Matlab and Hadoop tool sharing technologies                   choice of neural networks learning
discussed above can find application for                      algorithms for indoor mobile
optimizing the process of using neural                        positioning. Proceeding CEE-SECR
networks to solve various applied problems of                 '15 Proceedings of the 11th Central &
artificial intelligence.                                      Eastern       European       Software
                                                              Engineering Conference in Russia.
Acknowledgments                                               Moscow,        Russian     Federation.
                                                              October 22-24, 2015. ACM New York,
The work was partially supported by the                       NY, USA ©2015. Article No. 9.
grant of the MES RK: project No.                    [Sha09] Sharma G., Martin J. MATLAB®: A
AP05133699 "Research and development of                       language for parallel computing.
innovative information and telecommunication                  International Journal of Parallel
technologies using modern cyber technical                     Programming. 2009. Vol. 37. No 1.
means for the city's intelligent transport                    Pp. 3–36.
system".                                            [Pao89] Pao Y. Adaptive pattern recognition
                                                              and neural networks. Reading, MA:
References                                                    Addison-Wesley, 1989. 309 p.
                                                    [Sha09] Sharma G., Martin J. MATLAB®: a
[Bah15] Bahrami M., Singhal M. The role of                    language for parallel computing.
         cloud computing architecture in big                  International Journal of Parallel
         data. Information granularity, big                   Programming. 2009. Vol. 37. No. 1. –
         data, and computational intelligence.                Pp. 3-36.
         Springer, Cham. Pp.275–295.                [Kho15] Khomonenko A. D., Yakovlev E. L.
[Peh19] Pehcevski J. (2019). Big data analytics:              Neural network approximation of
         methods and applications. Arcler
                                                              characteristics of multi-channel non-
         Press. Canada. 430 p.
                                                              Markovian queuing systems. SPIIRAS
[Gha15] Ghazi M. R., Gangodkar D. Hadoop,
                                                              Proceedings. 2015. Issue 4(41).
         MapReduce and HDFS: a developers
         perspective. Procedia Computer                       Pp.81-93.
         Science. Pp. 45–50.


                                                                                                  58