Integration of Big Data Processing Tools and Neural Networks for Image Classification Nikita E. Kosykh Anatoly D. Khomonenko Emperor Alexander I St. Petersburg Emperor Alexander I St. Petersburg State State Transport University, Transport University, Saint Petersburg, Russia Saint Petersburg, Russia nikitosagi@mail.ru khomon@mail.ru Alexander P. Bochkov Anatoly V. Kikot Peter the Great St. Petersburg Emperor Alexander I St. Petersburg State Polytechnic University, Transport University, Saint Petersburg, Russia Saint Petersburg, Russia kostpea@mail.ru a.v.kikot@yandex.ru Deep learning of neural networks is a very popular topic these days, especially in computer vision applications. Until 2010, there was no database[Liu16] sufficiently complete and Abstract voluminous in order to train a neural network in a high-quality way to solve certain problems related The issues of joint use of tools for to image recognition. Existing solutions had a high processing big data in solving problems of level of inaccuracy. With the emergence of open artificial intelligence are becoming databases, such as Kaggle (big data bank) and increasingly important. The article ImageNet (image bank), it is possible to train discusses the task of optimizing the neural networks and make practically error-free parameters of neural networks used for image recognition using Matlab and decisions. The article discusses the task of Hadoop systems, as well as the MNIST optimizing the parameters of neural networks used (Modified National Institute of Standards for image recognition using Matlab and Hadoop and Technology) database is a voluminous systems. database of handwritten number samples. Justification of the choice of programming The results of calculating the optimal environment. The Matlab 2016a[Kra18] chosen as number of neural network layers for an integrated programming environment for solving the classification problem of working with big data. It supports some tools for images presented. We study the issues of solving problems, namely: evaluating the accuracy of image • Variables stored on the hard disk. Using the classification depending on the number of matfile function, you can access MATLAB network neurons, choosing the optimal variables directly from the MAT file on disk, network training algorithm, and evaluating without loading the entire variable into the effect of parallelization[Sha09] using memory. This will allow processing of large MATLAB Distributed Computing Server in data sets that do not fit in memory. the process of training a neural network on computing performance. • Datastore. The datastore function allows you to access data that does not fit in memory. This may include data from files, 1 Introduction sets or tables. 1The issues of joint use of tools for processing • Parallel computing. Parallel Computing Toolbox implements parallel loops to run big data[Bah15] in solving problems of artificial MATLAB code on multi-core architectures. intelligence are becoming increasingly important Machine learning. Machine learning used to Copyright c by the paper's authors. Use permitted under Creative develop predictive models for big data. The whole Commons License Attribution 4.0 International (CC BY 4.0). range of machine learning algorithms is contained In: A. Khomonenko, B. Sokolov, K. Ivanova (eds.): Selected in the Statistic Toolbox and Neural Papers of the Models and Methods of Information Systems Research Workshop, St. Petersburg, Russia, 4-5 Dec. 2019, published at Network[Pao89] Toolbox modules. http://ceur-ws.org. 52 • hadoop. Using the MapReduce[Gha15] and To display an image, you need to create a DataStore functionality built into MATLAB, you can graphic context (window) using the function figure. develop algorithms on a personal computer and As parameters, you can pass properties for a run them on Hadoop. You can request a piece of graphic context to a function. Initially handwritten data using the datastore function, and then using numbers are displayed in colors that are presented the Distributed Computer Server, run the in a color palette of green and blue shades (Fig. 1). algorithms within the Hadoop MapReduce environment on the complete set of data. 2 MNIST Dataset of Digit Patterns The MNIST (Modified National Institute of Standards and Technology) database is a voluminous database of handwritten number samples. The database is a standard proposed by the US National Institute of Standards and Technology for the purpose of calibrating and comparing image recognition methods using machine learning, primarily based on neural networks. Neural networks tend to learn better from Figure. 1: Display the default set specific examples. In the development of For a better perception, we use the colormap () neural[Nov15] network for pattern function, which sets the color map of the final recognition, we use the popular MNIST image. handwritten data set (Hadoop, 2016). Kaggle's Colormap (gray) – sets a linear palette in shades open portal for distributing big data [Peh19] of gray. uses this very kit in the Digit Recognizer The following code fragment reads the first 16 training contest. The set contains the following lines of the data set, which are 16 digits. Converts components: rows into matrices and displays the resulting a) trainSet.csv – training data; images on the screen, while the numbers on the top b) testSet.csv – test data for show the object class numbers. presentation. First you need to load the training data figure colormap(gray) into MATLAB (MATLAB, 2018). For this we use for i = 1:25 the built-in function csvread. subplot(5,5,i) matrix = reshape(trainSet(i, 2:end), M = csvread (filename, R1, C1) [28,28])' trainSet = csvread (trainSet.csv, 1,0) imagesc(matrix) title(num2str(tr(i, 1))) testSet = csvread (testSet.csv, 1,0) end trainSet = csvread ('trainSet.csv', 1,0) subSet = csvread ('testSet.csv', 1, 0); After the above code fragment executed, an image of 16 handwritten numbers will appear on The first column in the trainset set is a label the screen, placed on a single graphic screen in a that shows the correct number for each sample in black and white palette (Fig. 2). the data set, and each line is a sample. In the remaining columns, the row is an image of the handwritten digit 28x28, but all the pixels are placed in one row, and not in the original rectangular shape. To render the numbers, we need to rebuild the rows into 28x28 matrices. To do this, you can use the Reshape function, with the exception that you need to transpose the matrix, because the Reshape function works in columns and not line by line. Figure. 2: Image handwritten numbers 53 To build the network[Kho15], we will use D. For the input, choose XtrainSet and for the tool for pattern recognition ‘nprtool’ from classes, YtrainSet. the Neural Network Toolbox module. E. After installing the arrays, go to the “Data for verification” section. In this case, you 3 Data preparation can leave the default values. This will divide The input tool expects two sets of data: the data in the ratio of 70-15-15 into sets for training, testing and testing. a) input – a numeric matrix, each column F. In the network architecture section, of which represents samples and rows. change the value of hidden layers to 100 and These are scanned images of go on. handwritten numbers; G. For training, go to the Train Network b) vectorLabels – matrix-row binary form section and click “Train” to start training. Upon of 0 and 1, which is mapped to specific completion of the operation, a window with labels that represent the image. It is learning results will open. Numerical also called a dummy variable. Neural indicators can viewed as graphs and charts. Network ToolBox also expects tags to H. On the last tab, you will prompted to be stored in columns and not in rows. save the learning script for the model you just Class labels range from 0 to 9, as there are created. (eg NeuralNetworkSctipt.m). exactly so many single-digit numbers. To solve Below is a diagram of a model of an this problem, we will use “10” instead of “0”, artificial neural network (Figure 3.), which was because MATLAB indexing starts from 1. created using the pattern recognition tool. It The trainSet dataset stores sample images has 784 input neurons, 100 hidden layer in rows, not columns, so the entire dataset neurons and 10 output word neurons (which is needs to transpose. The test set of data does equal to the number of classes for prediction). not come out in advance in the MNIST set. To form it, we will hold 1/3 of each data set (training, test). n = size(trainSet, 1); labels = trainSet(:,1); labels(labels == 0) = 10; Figure. 3: Neural network model with labelsd = dummyvar(labels); parameters inputs = trainSet(:,2:end) inputs = inputs'; % transposition of the The model trained by adjusting the weights set of predictors for correct results. W in the diagram denotes labels = labels'; weights and b – displacement neuron, which labelsd = labelsd'; are part of individual neurons. Separate rng (1); c = cvpartition (n, 'Holdout', n / 3); neurons in the hidden layer are as follows: 784 XtrainSet = inputs (:, training (c)); input neurons and correspondingly the same YtrainSet = labelsd (:, training (c)); weights, 1 offset unit and 10 activation XtestSet = inputs (:, test (c)); outputs. YtestSet = labels (test (c)); YtestSetd = labelsd (:, test (c)); 5 Data Visualization Explanation: The function c = cvpartition For begin you should look inside the structure (n, ‘Holdout, p) randomly creates a section for of the NeuralNetworkSctipt.m file. There you validating validation during observations. This can see the generated variables, such as IW1_1 section divides the array into tutorials and a and x1_step1_keep, which are weights vectors test set. Parameter p must be a scalar. When 0 obtained in the process of training the