-

Fractal Distribution of Medical Data in Neural Network

0 Lviv Polytechnic National University , Lviv 79013 , Ukraine Julius-Maximilians-University Würzburg , Am Hubland, D-97074 Würzburg, German

0000 0002

Nowadays the topic of deep learning is becoming more and more popular. Moreover, almost every organization want to have at least one specialist in this area, because artificial intelligence can help your medicine to grow and to increase its productivity. Research of one of the types of neural network - fractal neural network. Testing and comparing with other neural networks. We will take one dataset and test it on our neural networks and then compare the results. Trained and tested neural networks with graphs and comparisons of their output. In the current paper we implemented custom neural network and fractal neural network. Then we trained and tested them on CIFAR-10 dataset. Custom neural network showed us worse results, but each iteration took up to 10 seconds, when 1 iteration of fractal neural network took up to 3 minutes. Moreover, our network is pretty simple, so we can say that that is suits better for datasets with lower quantity of classes. Fractal neural network showed us pretty good results, but I am sure that with more powerful computing resources and more time it can perform much better.

neural networks model medical data keras train dataset accuracy loss

In the current paper we want to make a research about one part of deep learning – fractal neural networks. A neural network is a network or circuit of neurons, or in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of real biological neurons, or an artificial neural network, for solving artificial intelligence (AI) problems. The connections of the biological neuron are modeled as weights. A positive weight reflects an excitatory connection, while negative values mean inhibitory connections. All inputs are modified by a weight and summed. This activity is referred as a linear combination. Finally, an activation function controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be −1 and 1. There are many types of neural networks, and residual neural network is one of them [ 1, 9 ].

A residual neural network (ResNet) is an artificial neural network (ANN) of a kind that builds on constructs known from pyramidal cells in the cerebral cortex. Residual neural networks do this by utilizing skip connections, or short-cuts to jump over some layers. Typical ResNet models are implemented with double- or triplelayer skips that contain nonlinearities (ReLu) and batch normalization in between. An additional weight matrix may be used to learn the skip weights; these models are known as HighwayNets. Models with several parallel skips are referred to as DenseNets. In the context of residual neural networks, a non-residual network may be described as a plain network.

One motivation for skipping over layers is to avoid the problem of vanishing gradients, by reusing activations from a previous layer until the adjacent layer learns its weights. During training, the weights adapt to mute the upstream layer, and amplify the previously-skipped layer. In the simplest case, only the weights for the adjacent layer's connection are adapted, with no explicit weights for the upstream layer. This works best when a single non-linear layer is stepped over, or when the intermediate layers are all linear. If not, then an explicit weight matrix should be learned for the skipped connection (a HighwayNet should be used) [ 1-4, 6 ].

Fractal neural network uses non-residual network approach. Macro-architecture of fractal neural networks is based on self-similarity. Repeated application of a simple expansion rule generates deep networks whose structural layouts are precisely truncated fractals. These networks contain interacting subpath of different lengths, but do not include any pass-through or residual connections; every internal signal is transformed by a filter and nonlinearity before being seen by subsequent layers. The key may be the ability to transition, during training, from effectively shallow to deep. Additionally, fractal networks exhibit an anytime property: shallow subnetworks provide a quick answer, while deeper subnetworks, with higher latency, provide a more accurate answer [ 3 ]. 2

Review of the Literature

Fractal neural networks are relatively new, that is why there are only a few articles on this theme. Frankly speaking, there is only one brief and complex paper about Fractal neural networks. It was published at ICLR 2017 as a conference paper by Gustav Larsson, Michael Maire and Gregory Shakhanaovich [ 11 ]. Their paper is called “FractalNet: Ultra-Deep Neural Networks without Residuals”. They briefly describe fractal neural networks and how do they work. Also, they compare the results of this network with more than 20 other networks on about 10 different datasets. They published code for FractalNet implementation which weare going to update and use in current paper. So, their paper is very useful, full of important information. They have very powerful computing resources, which helps them to train and test networks on a different data for a long time.

Materials and Methods

In order to implement and run our networks we will use Python 3 and Google Collaboratory as our working environment.

Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud. With Colaboratory you can write and execute code, save and share your analyses, and access powerful computing resources, all for free from your browser. Also,it provides good GPU in order to operate our networks [ 4, 12 ].

For training and testing we pick CIFAR10 dataset from Keras.

Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research [ 5, 10 ].

The CIFAR-10 dataset is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for machine learning research. The CIFAR-10 dataset contains 60,000 32x32 color images in 10 different classes. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 images of each class [ 6 ].

Computer algorithms for recognizing objects in photos often learn by example. CIFAR-10 is a set of images that can be used to teach computer how to recognize objects. Since the images in CIFAR-10 are low-resolution (32x32), this dataset can allow researchers to quickly try different algorithms to see what works. Various kinds of convolutional neural networks tend to be the best at recognizing the images in CIFAR-10.

In order to implement our Sequential model we will use the following layers and functions:

1) ReLU stands for rectified linear unit, and is a type of activation function. Mathematically, it is defined as y = max(0, x). ReLU is linear (identity) for all positive values, and zero for all negative values. This means that [ 7 ]:

It’s cheap to compute as there is no complicated math. The model can therefore take less time to train or run.

It converges faster. Linearity means that the slope doesn’t plateau, or “saturate,” when x gets large. It doesn’t have the vanishing gradient problem suffered by other activation functions like sigmoid or tanh.

It’s sparsely activated. Since ReLU is zero for all negative inputs, it’s likely for any given unit to not activate at all.

2) Softmax is a function that takes as input a vector of K real numbers, and normalizes it into a probability distribution consisting of K probabilities. That is, prior to applying softmax, some vector components could be negative, or greater than one; and might not sum to 1, but after applying softmax, each component will be in interval(0,1),and the components will add up to 1, so that they can be interpreted as probabilities. Softmax is often used in neural networks, to map the non-normalized output of a network to a probability distribution over predicted output classes[ 8, 10 ].

3) Dropout is a regularization technique for neural network models. Dropout is a technique where randomly selected neurons are ignored during training. They are “dropped-out” randomly. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass.

As a neural network learns, neuron weights settle into their context within the network. Weights of neurons are tuned for specific features providing some specialization. Neighboring neurons become to rely on this specialization, which if taken too far can result in a fragile model too specialized to the training data. This reliant on context for a neuron during training is referred to complex co-adaptations [ 9, 17 ] 4) Max pooling is a sample-based discretization process. The objective is to downsample an input representation (image, hidden-layer output matrix, etc.), reducing its dimensionality and allowing for assumptions to be made about features contained in the sub-regions binned.

This is done to in part to help over-fitting by providing an abstracted form of the representation. As well, it reduces the computational cost by reducing the number of parameters to learn and provides basic translation invariance to the internal representation[ 10, 15 ].

Also,we will use the optimization algorithms described below: 1) The RMSprop optimizer is similar to the gradient descent algorithm with momentum. The RMSprop optimizer restricts the oscillations in the vertical direction. Therefore, we can increase our learning rate and our algorithm could take larger steps in the horizontal direction converging faster[ 11, 12 ].

2) Adaptive Moment Estimation (Adam) is a method that computes adaptive learning rates for each parameter. It stores both the decaying average of the past gradients mt, similar to momentum and also the decaying average of the past squared gradients vt, similar to RMSprop and Adadelta. Thus, it combines the advantages of both the methods. Adam is the default choice of the optimizer for any application in general [ 11, 13 ]. 4

Experiment

So, for training our networks, we chose CIFAR10 dataset. We will train our network to classify 10 different objects: doctor, patient, disease, mode, ward, hospital, surgery, tablet, syringe, prescription. The classes are completely mutually exclusive. There is no overlapping between classes. This means that you will, not find an image with 2 different classes at the same time.

This means, that we could apply our network to solve different medical problems. For example, it can be helpful for predicting diagnosis relying on the cardiogram.

We will make custom sequential model for comparing with fractal one. Sequential model is simply a linear stack of layers. So, you can just create an empty model, and then add as many layers as you want. In this model we add few activation layers, connection layers, regularization layers, convolutional layers, pooling layers. Here is our final version

In the table above you can see the full training process with its accuracy and loss at each step of the training. The best results are highlighted

Also, on the following graphs (Fig. 3, 4) you can see a dependency of accuracy and loss according to epochs. Accuracy is calculated as the amount of right predictions divided by all predictions.

So, from the graph we can see the logarithmic increase of accuracy. Also, we can notice optimal amount of training after which the accuracy increases very slightly.

Model for fractal neural network is much more complicated than our custom model. It has much more layers and much more configurations. The full implementation of the fractal neural network model could be found by the link in references [ 9 ]. It was published with a paper at ICLR 2017 by Gustav Larsson, Michael Maire and Gregory Shakhanaovich, as I mentioned in literature review section [ 11 ].

Now let us train this network the same way as we did with our custom network. This time we will make 70 epochs, because training fractal network takes more time and computing resources. Below you can see a piece of our training process (Fig. 5).

In the table 2 you can see the full training process with its accuracy and loss at each step of the training. Best results are highlighted. Best results are marked with green color.

Also, on the following graphs (Fig.6, 7) you can see a dependency of accuracy and loss according to epochs. As with our custom network accuracy is calculated as the amount of right predictions divided by all predictions. So, from the graph we can see the logarithmic increase of accuracy. Also, we can notice optimal amount of training after which the accuracy in сreases very slightly. So, it looks similar to our custom neural network graph, but as we see, that the accuracy here is better. Now it is time to test our trained models on a test dataset. It is the set of images which haven’t been used during training process. The process is similar, but we iterate through our dataset only once, and output the results immediately. The results for our custom network are the following (Fig. 8).

This test showed us 0.7929 accuracy, which means that from 10000 labeled images with 10 different object classes our network predicted 7929 images right and 2071 images wrong. Our test accuracy become lower than the training one (8.111), which means that we overfit our model on a train dataset a little bit. It means that our weight fits a bit better for our train dataset. Lowering the training time may improve our accuracy a little bit.

Now let us head back to our fractal network. Our best accuracy was achieved at the very the end of epochs, which means, that further training may lead to better results. But it will take more time and more computing resources. Our accuracy is pretty good, but first let us test in on test data set and check if we didn’t overfit our network (Fig. 9).

This test showed us 0.8864 accuracy,which means that from 10000 labeled images with 10 different object classes our network predicted 8864 images right and 1136 images wrong.Our test accuracy is the same as train one(0.8864),which means that we didn’t overfit our model on a train dataset.

In the table below you can see the final comparison of our models. All trainings and testing were made inside Google Collaboratory with its own GPU. In the current paper we run custom neural network and fractal neural network inside Google Collaboratory using given GPU. Then we trained and tested them on CIFAR10 dataset. Custom neural network showed us worse results than fractal one, but each iteration took up to 10 seconds, when 1iteration of fractal neural network took up to 3 minutes. Moreover, our network is pretty simple, so we can say that that is suits better for datasets with lower quantity of classes. Fractal neural network showed us pretty good results, but we are sure that with more powerful computing resources and more time it can perform much better.

As we mentioned before, we can apply this technology on a different medical data to solve various kinds of medical problems. This can help to decrease the amount of human mistakes.

[1] Estivill-Castro , V. , Lee , I. : Amoeba: Hierarchical clustering based on spatial proximity using Delaunay diagram , 9th Intern. Symp. on spatial data handling , pp. 26 - 41 Beijing, China ( 2000 ).

[2] Kang , H.-Y., Lim , B.-J. , Li , K. -J.: P2P Spatial query processing by Delaunay triangulation, Lecture notes in computer science , vol. 3428 , pp. 136 - 150 , Springer/Heidelberg ( 2005 ).

[3] Boehm , C. , Kailing , K. , Kriegel , H. , Kroeger , P. : Density connected clus-tering with local subspace preferences , IEEE Computer Society, Proc. of the 4th IEEE Intern. conf. on data mining , pp. 27 - 34 , Los Alamitos ( 2004 ).

[4] Boyko , N. , Shakhovska , N. , Basystiuk , O. : Performance evaluation and comparison of software for face recognition, based on dlib and opencv library , Second International Conference on Data Stream Mining and Processing , pp. 478 - 482 , DSMP ( 2018 ).

[5] Boehm , C. , Kailing , K. , Kriegel , H. , Kroeger , P. : Density connected clus-tering with local subspace preferences” IEEE Computer Society , Proc. of the 4th IEEE Intern. conf. on data mining , pp. 27 - 34 , Los Alamitos ( 2004 ).

[6] Harel , D. , Koren , Y. : Clustering spatial data using random walks , Proc. of the 7th ACM SIGKDD Intern. conf. on knowledge discovery and data mining , pp. 281 - 286 , San Francisco, California ( 2000 ).

[7] Tung , A.K. , Hou , J ., Han, J . : Spatial clustering in the presence of obstacles , The 17th Intern. conf. on data engineering (ICDE'01) , pp. 359 - 367 , Heidelberg ( 2001 ).

[8] Veres , O. , Shakhovska , N.: Elements of the formal model big date , The 11th Intern. conf. Perspective Technologies and Methods in MEMS Design (MEMSTEH) , pp. 81 - 83 , Polyana ( 2015 ).

[9] Agrawal , R. , Gehrke , J. , Gunopulos , D. , Raghavan , P. : Automatic sub-space clustering of high dimensional data , vol. 11 ( 1 ), pp. 5 - 33 , Data mining knowledge discovery ( 2005 ).

[10] Ankerst , M. , Ester , M. , Kriegel , H.-P.: Towards an effective cooperation of the user and the computer for classification , Proc. of the 6th ACM SIGKDD Intern. conf. on knowledge discovery and data mining , pp. 179 - 188 , Boston, Massachusetts, USA ( 2000 ).

[11] Guo , D. , Peuquet , D.J. , Gahegan , M.: ICEAGE: Interactive clustering and exploration of large and high-dimensional geodata , vol. 3, N. 7 , pp. 229 - 253 , Geoinfor-matica ( 2003 ).

[12] Boyko , N. , Shakhovska , N. , Sviridova , N.: Use of machine learning in the forecast of clinical consequences of cancer diseases , In 7th Mediterranean Conference on Embedded Computing , pp. 531 - 536 , IEEE MECO' 2018 ( 2018 ).

[13] Boyko , N. : Advanced technologies of big data research in distributed information systems , Radio Electronics, Computer Science, Control. № 4 , pp. 66 - 77 , Zaporizhzhya: Zaporizhzhya National Technical University ( 2016 ).

[14] Larsson , G. , Maire , M. , Shakhnarovich , G.: FractalNet: Ultra-Deep Neural Networks without Residuals , http://people.cs.uchicago.edu/~larsson/fractalnet/

[15] Mochurad , L. , Solomiia , A. : Optimizing the Computational Modeling of Modern Electronic Optical Systems . In: Lytvynenko V., Babichev

, Wójcik

, Vynokurova

, Vyshemyrskaya

, Radetskaya

. (eds) Lecture Notes in Computational Intelligence and Decision Making , pp 597 - 608 , ISDMCI 2019 . Advances in Intelligent Systems and Computing , vol 1020 . Springer, Cham. ( 2019 )