Image clustering by autoencoders

               A S Kovalenko1, Y M Demyanenko1

               1Institute of mathematics, mechanics and computer Sciences named after I.I.

               Vorovich, Milchakova street, 8a, Rostov-on-Don, Russia, 344090


               e-mail: alexey.s.russ@mail.ru, demyanam@gmail.com


               Abstract. This paper describes an approach to solving the problem of finding similar
               images by visual similarity using neural networks on previously unmarked data. We
               propose to build special architecture of the neural network - autoencoder, through
               which high-level features are extracted from images. The search for the nearest
               elements is realized by the Euclidean metric in the generated feature space, after a
               preliminary decomposition into two-dimensional space. Proposed approach of generate
               feature space can be applied to the classification task using pre-clustering.


1. Introduction
Nowadays there are a large number of approaches to solving the classification problem [1]. But
as a rule, they all resolve into the use of a model class with a teacher-learning-based algorithm.
All of them are united by one major drawback - the requirement of marked data for training.
When a new task arises in the field of computer vision, as a rule, the marked data is missing,
and it is necessary to spend money on their marking.
    If we consider approaches based on methods of uncontrolled learning, such as clustering
algorithms, they are usually focused on working with data of small dimensions. If we consider
images as processed data, they usually have a high dimension. Lowering the dimension of space,
for example, by the method of principal components, still does not give space to which clustering
algorithms can be effectively applied.
    There is a need to build a map acting from the image space Ω (1) into a certain feature space
of these images, to which you can effectively apply decomposition methods and directly produce
clustering and search for nearby objects by visual component.

2. Latest work
The most frequently used approach to solving the problem of reducing the dimension is the
method of principal components. But it can be applied only to rectilinear data. If we consider
objects of large sizes, the probability of their good separation becomes small. But if they
constitute a mixture of objects belonging to normal distributions with different parameters, they
can be separated using the t-SNE algorithm (Laurens van der Maaten Visualizing Data using
t-SNE) [2]. In most cases, there is work with data that does not meet these requirements. There
is a need to build a mapping from the current space of objects into the space of their descriptive
features, which will be imposed the requirement of their distribution under the normal law.
Such a problem is considered in the paper on variational autoencoders (Doersch C. Tutorial on
Variational Autoencoders) [3].

                V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)
Image Processing and Earth Remote Sensing
A S Kovalenko, Y M Demyanenko


   In our work, for this purpose an encoder was built and trained, which is the necessary mapping
into the space of attributes of images distributed according to the normal law. Further, the t-SNE
algorithm can be applied to the resulting space to further decompose the space into dimensions,
where clustering algorithms will work well.

3. Building of encoder
For training and testing of models, a set of images ”MNIST”, which contains the images
collection of handwritten numbers. An example of the data is shown in the figure 1.


                                            Figure 1. MNIST example.

   Autoencoder was built with the architecture shown in the figure 2. For the general concept
of architecture in more detail see [3].


                              Figure 2. General block diagram of the auto
                              encoder model, where the blue blocks indicate
                              error functions.

  During the experiments, the following architectures of the neural networks of the encoder
(Q) and decoder (P ) networks, built only on fully connected layers and convolutional networks,
were used. The parameter for these networks was the dimension of the hidden space.


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)          244
Image Processing and Earth Remote Sensing
A S Kovalenko, Y M Demyanenko


   A detailed description of the architectures of these models in the form of flowcharts is
in the repository attached to the work. A brief description of the models is shown in the
pictures 3 and 4.


                                 Figure 3. Summary for a model built on fully
                                 connected layers.


                                 Figure 4. Summary for a model built on
                                 convolutional layers.

   Both models were trained on the above 60000 data set. The Adam [4] optimization algorithm
was used for training. Number of learning epochs is 500. Keras framework: keras with backend
tensorflow was used for building and learning.
   At the output, the encoder returns two vectors: the vector of the mean value for the
distribution to which the object belongs and the vector of the covariance matrix in a diagonal
form. To construct the set H, we use the average value, since it characterizes the cluster centroid
in the space of hidden features, where Ω - original objects set (1), g - encoder.

                                                       Ω = {Im }N
                                                                m=1                            (1)
   After training encoder, we construct the desired set H as follows:

                                                   H = {g(x)|x ∈ Ω}.                           (2)

   Now we lower the dimension of the set H using the algorithm of distributed stochastic selection
of neighbors (t-SNE):
                          Ĥ = t-SNE(H), ∀h ∈ Ĥ ⇒ dim(h) = 2.                                 (3)
On the other hand, you can set the dimension of the hidden space of the variational auto
encoder, equal to 2, and then we get a set of objects of the desired dimension. But in this case,
the loss of information increases when the object is encoded by an encoder, and the results of
decomposition with this approach are worse.


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)           245
Image Processing and Earth Remote Sensing
A S Kovalenko, Y M Demyanenko


4. Experiments
Was took 5000 elements from the ” MNIST ” data set and under exposure of the encoder using
the t-SNE algorithm, we get the set Ĥ (3). Consider the examples of Ĥ sets obtained using the
g encoder model with various parameters of the hidden space size. Figure 5 depicts the set Ĥ
when using the dimension of the hidden space equal to 2, without further using t-SNE.


                                            Figure 5.       Visualization of
                                            the hidden space of dimension
                                            2, obtained by a fully connected
                                            encoder.

   Figure 6 depicts the set Ĥ constructed using a fully meshed model with a hidden space
dimension parameter of 10 using the t-SNE algorithm.


                                            Figure 6.      Hidden space of
                                            dimension 10, obtained by a fully
                                            connected encoder, followed by t-
                                            SNE.

   Figure 7 depicts the set Ĥ constructed using the convolutional model with the parameter of
the dimension of the hidden space equal to 10 using the t-SNE algorithm.
   It can be observed that when using the hidden space of a higher dimension and the subsequent
action on it by the t-SNE algorithm, the classes become better separable, picture 6. This is due
to the fact that the auto encoder better restores the image at the output and, as a result, the
space of hidden features becomes more representative. When using convolutional architecture,

V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)        246
Image Processing and Earth Remote Sensing
A S Kovalenko, Y M Demyanenko


                                            Figure 7.      Hidden space of
                                            dimension 10, obtained by a
                                            convolutional encoder, followed
                                            by t-SNE.


visual separability of classes shows similar results, picture 7, and the number of parameters
for this architecture is an order of magnitude smaller than that of the whole consisting of fully
connected layers. Since when training a variational autoencoder, the encoder tends to predict the
parameters of the normal distribution to which the object should belong, then when considering
the set of predicted average values, we get the space of normally distributed values. The t-SNE
algorithm uses the proximity metric of objects in the normal distribution, which results in a
good decomposition result. There is also a method for reducing the dimension of space, based
on the selection of the main components (PCA) [5], but it shows the worst results when applied
to this set of hidden features, picture 8.


                                            Figure 8. Visualization of the
                                            space obtained by the action of
                                            the encoder consisting of fully
                                            connected layers, after decompo-
                                            sition by PCA.


4.1. Search of nearest images and clustering
Random elements from the set Ĥ (3) are taken as an example of searching for the nearest images
by visual similarity, for which 5 nearest elements from the same set are found using the Euclidean

V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)          247
Image Processing and Earth Remote Sensing
A S Kovalenko, Y M Demyanenko


metric, pictures 9 and 10.
   Since in the set Ĥ (3) the data clusters have a random non-linear form, the clustering
algorithm DBSCAN [6] is applied. After splitting the set into clusters, 10 representatives of
each class are taken and the voting method selected labels for each class from the available data
markup. After that, the classification accuracy is evaluated. It is 82.2%. If you reduce the space
to dimension 2 only with an encoder, the accuracy is 75.9%.


                                Figure 9.       Original image (top) and 5
                                closest (bottom).


                                Figure 10.        Original image (top) and 5
                                closest (bottom).


5. Results
In this work, a variational auto encoder model was built, trained on the MNIST task data set.
The experiment shows that the described approach of using a larger dimension of the hidden
space with its further decomposition using the t-SNE method gives better separability of classes
compared to reducing the dimension only by the auto encoder, and gives higher accuracy when
clustering the set, 82.2% instead of 75.9%.

6. Conclusion
The considered approach allows us to solve the classification problem on previously unallocated
data and search for the closest ones based on visual similarity.
   You can also select the nearest elements to the cluster centroids, which will be more likely to
be correctly classified during clustering, and train the classifier based on the [7] neural network,
which may allow you to classify all the input data with better precision.


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)            248
Image Processing and Earth Remote Sensing
A S Kovalenko, Y M Demyanenko


7. Applications
Link to GitHub repository with implementation: Clustering-by-VAE


8. References
[1] Kotsiantis S 2007 Supervised Machine Learning: A Review of Classification Techniques
(Peloponnese: Department of Computer Science and Technology University of Peloponnese)
249-268
[2] Maaten L 2008 Visualizing Data using t-SNE Journal of Machine Learning Research 9
2579-2605
[3] Doersch C 2016 Tutorial on Variational Autoencoders CoRR ArXiv: abs/1606.05908
(28.05.2019)
[4] Kingma D 2015 Adam: A method for stochastic optimization Scottsdale: ICLR ArXiv:
abs/1412.6980 (28.05.2019)
[5] Shlens J 2014 A Tutorial on Principal Component Analysis CoRR ArXiv: abs/1404.1100
(28.05.2019)
[6] Ester M 1996 Density-Based Algorithm for Discovering Clusters in Large Spatial
Databases with Noise (AAAI Press) 226-231
[7] Wu H 2017 CNN-Based Recognition of Handwritten Digits in MNIST Database (Berlin:
Springer)


V International Conference on "Information Technology and Nanotechnology" (ITNT-2019)   249