1. Introduction

Image clustering by autoencoders

A S Kovalenko

Y M Demyanenko

0 0 Institute of mathematics, mechanics and computer Sciences named after I.I. Vorovich , Milchakova street, 8a, Rostov-on-Don, Russia, 344090

2019

243 249

This paper describes an approach to solving the problem of nding similar images by visual similarity using neural networks on previously unmarked data. We propose to build special architecture of the neural network - autoencoder, through which high-level features are extracted from images. The search for the nearest elements is realized by the Euclidean metric in the generated feature space, after a preliminary decomposition into two-dimensional space. Proposed approach of generate feature space canbe applied to the classi cation task usingpre-clustering.

1. Introduction

Nowadays there are a large number of approaches to solving the classi cation problem [ 1 ]. But as a rule, they all resolve into the use of a model class with a teacher-learning-based algorithm. All of them are united by one major drawback - the requirement of marked data for training. When a new task arises in the eld of computer vision, as a rule, the marked data is missing, and it is necessary to spend money on their marking.

If we consider approaches based on methods of uncontrolled learning, such as clustering algorithms, they are usually focused on working with data of small dimensions. If we consider images as processed data, they usually have a high dimension. Lowering the dimension of space, for example, by the method of principal components, still does not give space to which clustering algorithms can be e ectively applied.

There is a need to build a map acting from the image space (1) into a certain feature space of these images, to which you can e ectively apply decomposition methods and directly produce clustering and search for nearby objects by visual component.

2. Latest work

The most frequently used approach to solving the problem of reducing the dimension is the method of principal components. But it can be applied only to rectilinear data. If we consider objects of large sizes, the probability of their good separation becomes small. But if they constitute a mixture of objects belonging to normal distributions with di erent parameters, they can be separated using the t-SNE algorithm (Laurens van der Maaten Visualizing Data using t-SNE) [ 2 ]. In most cases, there is work with data that does not meet these requirements. There is a need to build a mapping from the current space of objects into the space of their descriptive features, which will be imposed the requirement of their distribution under the normal law. Such a problem is considered in the paper on variational autoencoders (Doersch C. Tutorial on Variational Autoencoders) [ 3 ].

In our work, for this purpose an encoder was built and trained, which is the necessary mapping into the space of attributes of images distributed according to the normal law. Further, the t-SNE algorithm can be applied to the resulting space to further decompose the space into dimensions, where clustering algorithms will work well.

3. Building of encoder

For training and testing of models, a set of images "MNIST", which contains the images collection of handwritten numbers. An example of the data is shown in the gure 1.

Autoencoder was built with the architecture shown in the gure 2. For the general concept of architecture in more detail see [ 3 ].

During the experiments, the following architectures of the neural networks of the encoder (Q) and decoder (P ) networks, built only on fully connected layers and convolutional networks, were used. The parameter for these networks was the dimension of the hidden space.

A detailed description of the architectures of these models in the form of owcharts is in the repository attached to the work. A brief description of the models is shown in the pictures 3 and 4.

Both models were trained on the above 60000 data set. The Adam [ 4 ] optimization algorithm was used for training. Number of learning epochs is 500. Keras framework: keras with backend tensor ow was used for building and learning.

At the output, the encoder returns two vectors: the vector of the mean value for the distribution to which the object belongs and the vector of the covariance matrix in a diagonal form. To construct the set H, we use the average value, since it characterizes the cluster centroid in the space of hidden features, where - original objects set (1), g - encoder.

After training encoder, we construct the desired set H as follows:

N = fImgm=1 H = fg(x)jx 2 g:

Now we lower the dimension of the set H using the algorithm of distributed stochastic selection of neighbors (t-SNE):

H^ = t-SNE(H); 8h 2 H ) dim(h) = 2: ^ (1) (2) (3) On the other hand, you can set the dimension of the hidden space of the variational auto encoder, equal to 2, and then we get a set of objects of the desired dimension. But in this case, the loss of information increases when the object is encoded by an encoder, and the results of decomposition with this approach are worse.

4. Experiments

Was took 5000 elements from the " MNIST " data set and under exposure of the encoder using the t-SNE algorithm, we get the set H^ (3). Consider the examples of H^ sets obtained using the g encoder model with various parameters of the hidden space size. Figure 5 depicts the set H^ when using the dimension of the hidden space equal to 2, without further using t-SNE.

Figure 6 depicts the set H^ constructed using a fully meshed model with a hidden space dimension parameter of 10 using the t-SNE algorithm.

Figure 7 depicts the set H^ constructed using the convolutional model with the parameter of the dimension of the hidden space equal to 10 using the t-SNE algorithm.

It can be observed that when using the hidden space of a higher dimension and the subsequent action on it by the t-SNE algorithm, the classes become better separable, picture 6. This is due to the fact that the auto encoder better restores the image at the output and, as a result, the space of hidden features becomes more representative. When using convolutional architecture, visual separability of classes shows similar results, picture 7, and the number of parameters for this architecture is an order of magnitude smaller than that of the whole consisting of fully connected layers. Since when training a variational autoencoder, the encoder tends to predict the parameters of the normal distribution to which the object should belong, then when considering the set of predicted average values, we get the space of normally distributed values. The t-SNE algorithm uses the proximity metric of objects in the normal distribution, which results in a good decomposition result. There is also a method for reducing the dimension of space, based on the selection of the main components (PCA) [ 5 ], but it shows the worst results when applied to this set of hidden features, picture 8. 4.1. Search of nearest images and clustering Random elements from the set H^ (3) are taken as an example of searching for the nearest images by visual similarity, for which 5 nearest elements from the same set are found using the Euclidean metric, pictures 9 and 10.

Since in the set H^ (3) the data clusters have a random non-linear form, the clustering algorithm DBSCAN [ 6 ] is applied. After splitting the set into clusters, 10 representatives of each class are taken and the voting method selected labels for each class from the available data markup. After that, the classi cation accuracy is evaluated. It is 82:2%. If you reduce the space to dimension 2 only with an encoder, the accuracy is 75:9%.

Original image (top) and 5 5. Results

In this work, a variational auto encoder model was built, trained on the MNIST task data set. The experiment shows that the described approach of using a larger dimension of the hidden space with its further decomposition using the t-SNE method gives better separability of classes compared to reducing the dimension only by the auto encoder, and gives higher accuracy when clustering the set, 82:2% instead of 75:9%.

6. Conclusion

The considered approach allows us to solve the classi cation problem on previously unallocated data and search for the closest ones based on visual similarity.

You can also select the nearest elements to the cluster centroids, which will be more likely to be correctly classi ed during clustering, and train the classi er based on the [ 7 ] neural network, which may allow you to classify all the input data with better precision.

[1] Kotsiantis

S 2007

Supervised Machine Learning: A Review of Classi cation Techniques (Peloponnese: Department of Computer Science and Technology University of Peloponnese) 249 - 268

[2] Maaten L 2008 Visualizing

Data using t-

SNEJournal of Machine Learning Research9 2579 - 2605

[3] Doersch

C 2016

Tutorial on Variational AutoencoCdoerRsR ArXiv : abs/1606.05908 ( 28 . 05 . 2019 )

[4] Kingma

D 2015

Adam:

A method for stochastic optimization Scottsdale: ICLR ArXiv: abs/1412.6980 ( 28 . 05 . 2019 )

[5] Shlens J 2014 A Tutorial on Principal Component AnalysiCsoRR ArXiv : abs/1404.1100 ( 28 . 05 . 2019 )

[6] Ester

M 1996

Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise (AAAI Press) 226 - 231

[7] Wu

H 2017

CNN-

Based Recognition of Handwritten Digits in MNIST Database (Berlin: Springer)