Integration of Big Data Processing Tools and Neural Networks for Image Classification

Nikita E. Kosykh, Anatoly D. Khomonenko, Alexander P. Bochkov, Anatoly V. Kikot

Emperor Alexander I St. Petersburg State Transport University, Saint Petersburg, Russia
Peter the Great St. Petersburg Polytechnic University, Saint Petersburg, Russia
             Integration of Big Data Processing Tools and Neural
                      Networks for Image Classification
               Nikita E. Kosykh                                          Anatoly D. Khomonenko
      Emperor Alexander I St. Petersburg                           Emperor Alexander I St. Petersburg State
         State Transport University,                                       Transport University,
          Saint Petersburg, Russia                                       Saint Petersburg, Russia
           nikitosagi@mail.ru                                              khomon@mail.ru

              Alexander P. Bochkov                                            Anatoly V. Kikot
          Peter the Great St. Petersburg                           Emperor Alexander I St. Petersburg State
             Polytechnic University,                                       Transport University,
            Saint Petersburg, Russia                                     Saint Petersburg, Russia
                 kostpea@mail.ru                                            a.v.kikot@yandex.ru

                                                                     Deep learning of neural networks is a very popular
                                                                     topic these days, especially in computer vision
                                                                     applications. Until 2010, there was no
                                                                     database[Liu16]        sufficiently   complete   and
                         Abstract                                    voluminous in order to train a neural network in a
                                                                     high-quality way to solve certain problems related
    The issues of joint use of tools for                             to image recognition. Existing solutions had a high
    processing big data in solving problems of                       level of inaccuracy. With the emergence of open
    artificial intelligence are becoming
                                                                     databases, such as Kaggle (big data bank) and
    increasingly important. The article
                                                                     ImageNet (image bank), it is possible to train
    discusses the task of optimizing the
                                                                     neural networks and make practically error-free
    parameters of neural networks used for
    image recognition using Matlab and                               decisions. The article discusses the task of
    Hadoop systems, as well as the MNIST                             optimizing the parameters of neural networks used
    (Modified National Institute of Standards                        for image recognition using Matlab and Hadoop
    and Technology) database is a voluminous                         systems.
    database of handwritten number samples.                          Justification of the choice of programming
    The results of calculating the optimal                           environment. The Matlab 2016a[Kra18] chosen as
    number of neural network layers for                              an integrated programming environment for
    solving the classification problem of                            working with big data. It supports some tools for
    images presented. We study the issues of                         solving problems, namely:
    evaluating the accuracy of image                                     • Variables stored on the hard disk. Using the
    classification depending on the number of                                matfile function, you can access MATLAB
    network neurons, choosing the optimal                                    variables directly from the MAT file on disk,
    network training algorithm, and evaluating                               without loading the entire variable into
    the effect of parallelization[Sha09] using
                                                                             memory. This will allow processing of large
    MATLAB Distributed Computing Server in
                                                                             data sets that do not fit in memory.
    the process of training a neural network on
    computing performance.                                               • Datastore. The datastore function allows
                                                                             you to access data that does not fit in
                                                                             memory. This may include data from files,
1 Introduction                                                               sets or tables.
    1The issues of joint use of tools for processing                     • Parallel computing. Parallel Computing
                                                                             Toolbox implements parallel loops to run
big data[Bah15] in solving problems of artificial
                                                                             MATLAB code on multi-core architectures.
intelligence are becoming increasingly important
                                                                         Machine learning. Machine learning used to
   •    hadoop. Using the MapReduce[Gha15] and             To display an image, you need to create a
DataStore functionality built into MATLAB, you can     graphic context (window) using the function figure.
develop algorithms on a personal computer and          As parameters, you can pass properties for a
run them on Hadoop. You can request a piece of         graphic context to a function. Initially handwritten
data using the datastore function, and then using      numbers are displayed in colors that are presented
the Distributed Computer Server, run the               in a color palette of green and blue shades (Fig. 1).
algorithms within the Hadoop MapReduce
environment on the complete set of data.

2 MNIST Dataset of Digit Patterns
The MNIST (Modified National Institute of
Standards and Technology) database is a
voluminous database of handwritten number
samples. The database is a standard proposed by
the US National Institute of Standards and
Technology for the purpose of calibrating and
comparing image recognition methods using
machine learning, primarily based on neural
    Neural networks tend to learn better from                      Figure. 1: Display the default set
specific examples. In the development of                   For a better perception, we use the colormap ()
neural[Nov15]       network      for     pattern       function, which sets the color map of the final
recognition, we use the popular MNIST                  image.
handwritten data set (Hadoop, 2016). Kaggle's              Colormap (gray) – sets a linear palette in shades
open portal for distributing big data [Peh19]          of gray.
uses this very kit in the Digit Recognizer                 The following code fragment reads the first 16
training contest. The set contains the following       lines of the data set, which are 16 digits. Converts
components:                                            rows into matrices and displays the resulting
         a) trainSet.csv – training data;              images on the screen, while the numbers on the top
         b) testSet.csv – test data for                show the object class numbers.
    First you need to load the training data           figure
into MATLAB (MATLAB, 2018). For this we use
                                                       for i = 1:25
the built-in function csvread.                         subplot(5,5,i)
                                                          matrix = reshape(trainSet(i,             2:end),
M = csvread (filename, R1, C1)                            [28,28])'
trainSet = csvread (trainSet.csv, 1,0)                 imagesc(matrix)
                                                       title(num2str(tr(i, 1)))
testSet = csvread (testSet.csv, 1,0)
trainSet = csvread ('trainSet.csv', 1,0)
subSet = csvread ('testSet.csv', 1, 0);
                                                          After the above code fragment executed, an
                                                       image of 16 handwritten numbers will appear on
    The first column in the trainset set is a label    the screen, placed on a single graphic screen in a
that shows the correct number for each sample in       black and white palette (Fig. 2).
the data set, and each line is a sample. In the
remaining columns, the row is an image of the
handwritten digit 28x28, but all the pixels are
placed in one row, and not in the original
rectangular shape. To render the numbers, we need
to rebuild the rows into 28x28 matrices. To do this,
you can use the Reshape function, with the
exception that you need to transpose the matrix,
because the Reshape function works in columns
and not line by line.

                                                             Figure. 2: Image handwritten numbers
    To build the network[Kho15], we will use           D. For the input, choose XtrainSet and for
the tool for pattern recognition ‘nprtool’ from    classes, YtrainSet.
the Neural Network Toolbox module.                     E. After installing the arrays, go to the
                                                   “Data for verification” section. In this case, you
3 Data preparation                                 can leave the default values. This will divide
The input tool expects two sets of data:           the data in the ratio of 70-15-15 into sets for
                                                   training, testing and testing.
   a) input – a numeric matrix, each column            F. In the network architecture section,
      of which represents samples and rows.        change the value of hidden layers to 100 and
      These are scanned images of                  go on.
      handwritten numbers;                             G. For training, go to the Train Network
   b) vectorLabels – matrix-row binary form        section and click “Train” to start training. Upon
      of 0 and 1, which is mapped to specific      completion of the operation, a window with
      labels that represent the image. It is       learning results will open. Numerical
      also called a dummy variable. Neural         indicators can viewed as graphs and charts.
      Network ToolBox also expects tags to             H. On the last tab, you will prompted to
      be stored in columns and not in rows.        save the learning script for the model you just
    Class labels range from 0 to 9, as there are   created. (eg NeuralNetworkSctipt.m).
exactly so many single-digit numbers. To solve         Below is a diagram of a model of an
this problem, we will use “10” instead of “0”,     artificial neural network (Figure 3.), which was
because MATLAB indexing starts from 1.             created using the pattern recognition tool. It
    The trainSet dataset stores sample images      has 784 input neurons, 100 hidden layer
in rows, not columns, so the entire dataset        neurons and 10 output word neurons (which is
needs to transpose. The test set of data does      equal to the number of classes for prediction).
not come out in advance in the MNIST set. To
form it, we will hold 1/3 of each data set
(training, test).
n = size(trainSet, 1);
labels = trainSet(:,1);
labels(labels == 0) = 10;                                 Figure. 3: Neural network model with
labelsd = dummyvar(labels);                                            parameters
inputs = trainSet(:,2:end)
inputs = inputs'; % transposition of the              The model trained by adjusting the weights
set of predictors                                  for correct results. W in the diagram denotes
labels = labels';                                  weights and b – displacement neuron, which
labelsd = labelsd';                                are part of individual neurons. Separate
rng (1);
c = cvpartition (n, 'Holdout', n / 3);             neurons in the hidden layer are as follows: 784
XtrainSet = inputs (:, training (c));              input neurons and correspondingly the same
YtrainSet = labelsd (:, training (c));             weights, 1 offset unit and 10 activation
XtestSet = inputs (:, test (c));                   outputs.
YtestSet = labels (test (c));
YtestSetd = labelsd (:, test (c));
                                                   5 Data Visualization
     Explanation: The function c = cvpartition     For begin you should look inside the structure
(n, ‘Holdout, p) randomly creates a section for    of the NeuralNetworkSctipt.m file. There you
validating validation during observations. This    can see the generated variables, such as IW1_1
section divides the array into tutorials and a     and x1_step1_keep, which are weights vectors
test set. Parameter p must be a scalar. When 0     obtained in the process of training the