<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Image clustering by autoencoders</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>A S Kovalenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Y M Demyanenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of mathematics, mechanics and computer Sciences named after I.I. Vorovich</institution>
          ,
          <addr-line>Milchakova street, 8a, Rostov-on-Don, Russia, 344090</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>243</fpage>
      <lpage>249</lpage>
      <abstract>
        <p>This paper describes an approach to solving the problem of nding similar images by visual similarity using neural networks on previously unmarked data. We propose to build special architecture of the neural network - autoencoder, through which high-level features are extracted from images. The search for the nearest elements is realized by the Euclidean metric in the generated feature space, after a preliminary decomposition into two-dimensional space. Proposed approach of generate feature space canbe applied to the classi cation task usingpre-clustering.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Nowadays there are a large number of approaches to solving the classi cation problem [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. But
as a rule, they all resolve into the use of a model class with a teacher-learning-based algorithm.
All of them are united by one major drawback - the requirement of marked data for training.
When a new task arises in the eld of computer vision, as a rule, the marked data is missing,
and it is necessary to spend money on their marking.
      </p>
      <p>If we consider approaches based on methods of uncontrolled learning, such as clustering
algorithms, they are usually focused on working with data of small dimensions. If we consider
images as processed data, they usually have a high dimension. Lowering the dimension of space,
for example, by the method of principal components, still does not give space to which clustering
algorithms can be e ectively applied.</p>
      <p>There is a need to build a map acting from the image space (1) into a certain feature space
of these images, to which you can e ectively apply decomposition methods and directly produce
clustering and search for nearby objects by visual component.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Latest work</title>
      <p>
        The most frequently used approach to solving the problem of reducing the dimension is the
method of principal components. But it can be applied only to rectilinear data. If we consider
objects of large sizes, the probability of their good separation becomes small. But if they
constitute a mixture of objects belonging to normal distributions with di erent parameters, they
can be separated using the t-SNE algorithm (Laurens van der Maaten Visualizing Data using
t-SNE) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In most cases, there is work with data that does not meet these requirements. There
is a need to build a mapping from the current space of objects into the space of their descriptive
features, which will be imposed the requirement of their distribution under the normal law.
Such a problem is considered in the paper on variational autoencoders (Doersch C. Tutorial on
Variational Autoencoders) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>In our work, for this purpose an encoder was built and trained, which is the necessary mapping
into the space of attributes of images distributed according to the normal law. Further, the t-SNE
algorithm can be applied to the resulting space to further decompose the space into dimensions,
where clustering algorithms will work well.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Building of encoder</title>
      <p>For training and testing of models, a set of images "MNIST", which contains the images
collection of handwritten numbers. An example of the data is shown in the gure 1.</p>
      <p>
        Autoencoder was built with the architecture shown in the gure 2. For the general concept
of architecture in more detail see [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>During the experiments, the following architectures of the neural networks of the encoder
(Q) and decoder (P ) networks, built only on fully connected layers and convolutional networks,
were used. The parameter for these networks was the dimension of the hidden space.</p>
      <p>A detailed description of the architectures of these models in the form of owcharts is
in the repository attached to the work. A brief description of the models is shown in the
pictures 3 and 4.</p>
      <p>
        Both models were trained on the above 60000 data set. The Adam [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] optimization algorithm
was used for training. Number of learning epochs is 500. Keras framework: keras with backend
tensor ow was used for building and learning.
      </p>
      <p>At the output, the encoder returns two vectors: the vector of the mean value for the
distribution to which the object belongs and the vector of the covariance matrix in a diagonal
form. To construct the set H, we use the average value, since it characterizes the cluster centroid
in the space of hidden features, where - original objects set (1), g - encoder.</p>
      <sec id="sec-3-1">
        <title>After training encoder, we construct the desired set H as follows:</title>
        <p>N
= fImgm=1
H = fg(x)jx 2
g:</p>
        <p>Now we lower the dimension of the set H using the algorithm of distributed stochastic selection
of neighbors (t-SNE):</p>
        <p>H^ = t-SNE(H); 8h 2 H ) dim(h) = 2:
^
(1)
(2)
(3)
On the other hand, you can set the dimension of the hidden space of the variational auto
encoder, equal to 2, and then we get a set of objects of the desired dimension. But in this case,
the loss of information increases when the object is encoded by an encoder, and the results of
decomposition with this approach are worse.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>Was took 5000 elements from the " MNIST " data set and under exposure of the encoder using
the t-SNE algorithm, we get the set H^ (3). Consider the examples of H^ sets obtained using the
g encoder model with various parameters of the hidden space size. Figure 5 depicts the set H^
when using the dimension of the hidden space equal to 2, without further using t-SNE.</p>
      <p>Figure 6 depicts the set H^ constructed using a fully meshed model with a hidden space
dimension parameter of 10 using the t-SNE algorithm.</p>
      <p>Figure 7 depicts the set H^ constructed using the convolutional model with the parameter of
the dimension of the hidden space equal to 10 using the t-SNE algorithm.</p>
      <p>
        It can be observed that when using the hidden space of a higher dimension and the subsequent
action on it by the t-SNE algorithm, the classes become better separable, picture 6. This is due
to the fact that the auto encoder better restores the image at the output and, as a result, the
space of hidden features becomes more representative. When using convolutional architecture,
visual separability of classes shows similar results, picture 7, and the number of parameters
for this architecture is an order of magnitude smaller than that of the whole consisting of fully
connected layers. Since when training a variational autoencoder, the encoder tends to predict the
parameters of the normal distribution to which the object should belong, then when considering
the set of predicted average values, we get the space of normally distributed values. The t-SNE
algorithm uses the proximity metric of objects in the normal distribution, which results in a
good decomposition result. There is also a method for reducing the dimension of space, based
on the selection of the main components (PCA) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], but it shows the worst results when applied
to this set of hidden features, picture 8.
4.1. Search of nearest images and clustering
Random elements from the set H^ (3) are taken as an example of searching for the nearest images
by visual similarity, for which 5 nearest elements from the same set are found using the Euclidean
metric, pictures 9 and 10.
      </p>
      <p>
        Since in the set H^ (3) the data clusters have a random non-linear form, the clustering
algorithm DBSCAN [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is applied. After splitting the set into clusters, 10 representatives of
each class are taken and the voting method selected labels for each class from the available data
markup. After that, the classi cation accuracy is evaluated. It is 82:2%. If you reduce the space
to dimension 2 only with an encoder, the accuracy is 75:9%.
      </p>
      <sec id="sec-4-1">
        <title>Original image (top) and 5</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>In this work, a variational auto encoder model was built, trained on the MNIST task data set.
The experiment shows that the described approach of using a larger dimension of the hidden
space with its further decomposition using the t-SNE method gives better separability of classes
compared to reducing the dimension only by the auto encoder, and gives higher accuracy when
clustering the set, 82:2% instead of 75:9%.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>The considered approach allows us to solve the classi cation problem on previously unallocated
data and search for the closest ones based on visual similarity.</p>
      <p>
        You can also select the nearest elements to the cluster centroids, which will be more likely to
be correctly classi ed during clustering, and train the classi er based on the [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] neural network,
which may allow you to classify all the input data with better precision.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Kotsiantis</surname>
            <given-names>S 2007</given-names>
          </string-name>
          <article-title>Supervised Machine Learning: A Review of Classi cation Techniques (Peloponnese: Department of Computer Science</article-title>
          and Technology University of Peloponnese)
          <fpage>249</fpage>
          -
          <lpage>268</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Maaten L 2008 Visualizing</surname>
          </string-name>
          <article-title>Data using t-</article-title>
          <source>SNEJournal of Machine Learning</source>
          Research9
          <fpage>2579</fpage>
          -
          <lpage>2605</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Doersch</surname>
            <given-names>C 2016</given-names>
          </string-name>
          <article-title>Tutorial on Variational AutoencoCdoerRsR ArXiv</article-title>
          : abs/1606.05908 (
          <issue>28</issue>
          .
          <fpage>05</fpage>
          .
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Kingma</surname>
            <given-names>D 2015</given-names>
          </string-name>
          <string-name>
            <surname>Adam:</surname>
          </string-name>
          <article-title>A method for stochastic optimization Scottsdale:</article-title>
          ICLR ArXiv: abs/1412.6980 (
          <issue>28</issue>
          .
          <fpage>05</fpage>
          .
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Shlens J 2014 A</surname>
          </string-name>
          <article-title>Tutorial on Principal Component AnalysiCsoRR ArXiv</article-title>
          : abs/1404.1100 (
          <issue>28</issue>
          .
          <fpage>05</fpage>
          .
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Ester</surname>
            <given-names>M 1996</given-names>
          </string-name>
          <article-title>Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise</article-title>
          (AAAI Press)
          <fpage>226</fpage>
          -
          <lpage>231</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Wu</surname>
            <given-names>H 2017</given-names>
          </string-name>
          <string-name>
            <surname>CNN-</surname>
          </string-name>
          <article-title>Based Recognition of Handwritten Digits in MNIST Database</article-title>
          (Berlin: Springer)
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>