<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UUnnssuuppeerrvviisseedd cclluusstteerriinngg wwiitthh ggrroowwiinngg self-organizing self-organizing nneeuurraall nneettwwoorrkk -- aa ccoommppaarriissoonn with with nnoonn--nneeuurraall aapppprrooaacchh</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martin Hynar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michal Burda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jana Sˇarmanova´ Martin Hynar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michal Burda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jana Sˇarmanov´a</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science</institution>
          ,
          <addr-line>VS</addr-line>
        </aff>
      </contrib-group>
      <fpage>58</fpage>
      <lpage>68</lpage>
      <abstract>
        <p>Usually used approaches for non-hierarchical clustering of data are well known k-means or k-medoids methods. However, these fundamental methods are poorly applicable in situations where number of clusters is almost unpredictable. Formerly, they were adapted to allow splitting and merging when some defined criterion is met. On the other hand there are also methods based on artificial neural networks concretely on self-organizing maps. One of the interesting ideas in this domain is to allow growing of the net which corresponds to adapted kmeans method. In this article we are going to compare both approaches in a view of ability to detect clusters in unknown data.</p>
      </abstract>
      <kwd-group>
        <kwd>data mining</kwd>
        <kwd>cluster analysis</kwd>
        <kwd>partitioning algorithms</kwd>
        <kwd>competitive learning</kwd>
        <kwd>self-organizing map</kwd>
        <kwd>growing neural network</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In common practice of various domains, e.g. pattern recognition (recognition
of objects in range of data, letters), information retrieval (grouping documents,
looking for common topics), data mining (searching for interesting patterns in
arbitrary data sets) there could be used as a valuable tool some clustering
technique. The most widely used techniques are mainly k-means based methods (see
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]), which are easy to use and obtained results are fairly
understandable. Nevertheless, these methods are too static in a particular point of view;
at the beginning we have to specify the number of expected clusters and the
algorithm is then responsible to find them in the data set. But, what if we could
not determine this number? There are two alternatives to solve such problem.
• Second solution is in adaptation of the algorithm where it will be allowed
to split and/or merge clusters according to some predenfied condition.
Usually, clusters are splitted if the inter-cluster variability increases over some
threshold and merged if the typical points of neighbouring clusters are near
enough. We can also ignore insufficiently numerous clusters.
      </p>
      <p>
        The second solution was used for example in algorithms like isodata (see
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) or class (see [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        On the other hand, we can use a model of articfiial neural network based on
competitive learning known as the self-organizing map (som) or Kohonen map
(see [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] or [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]). It is known that fundamental feature of som is to preserve data
topology. So, neurons of the map are likely to occupy the place in the input
space where more dense places are situated. The basic model of som consists of
neurons whose quantity is specified in advance. That is, with this model we are
able to discover only a predefined number of clusters.
      </p>
      <p>
        Like in previous case also som was adapted to allow the topology grow. One
of the usable approaches is the Growing neural gas (see [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ])
      </p>
      <p>In this article we first re-introduce the k-means algorithm in section 2 and one
of its derivative (class method) allowing adaptation of the number of clusters
in section 3. In the section 4 we focus on fundamentals of som and on brief
introduction of Growing neural gas algorithm in section 5. In the 6 there are
provided some experiments with the comparison of the results obtained with
both types of methods.
2</p>
      <p>k-means based methods
So called partitional clustering methods could be stated as “given N patterns
in n-dimensional space, determine the partition of the patterns into k groups
where patterns in the same group are more similar than patterns from different
groups” The notion of the patterns similarity have to be adopted in advance and
different algorithms use different ones.</p>
      <p>Moreover, the issue of determining the appropriate k is not decidable in all
situations. If there exist some general perspective over clustered data, we can use
a xfied value of k. The task of a partitional algorithm is then to locate clusters
in the input space. If we are not able to determine k at the beginning, we could
use trial-and-error way with modifying clustering parameters. However, there
are also methods trying to locate clusters in the input space and to determine
the number of such clusters at a time.</p>
      <p>An algorithm, generally known as k-means clustering is the one where the
number of expected clusters have to be given as the clustering parameter. Such
algorithm given the set of patterns then tries to locate k clusters.</p>
    </sec>
    <sec id="sec-2">
      <title>Choose typical points:</title>
      <p>Place k typical points of clusters according to chosen method into the
multidimensional space containing examined patterns.</p>
    </sec>
    <sec id="sec-3">
      <title>Clustering:</title>
      <p>Assign each pattern to exactly one typical point – to the nearest one.</p>
    </sec>
    <sec id="sec-4">
      <title>Recompute typical points:</title>
      <p>Using all patterns assigned to particular cluster recompute its typical point
as the mean value.</p>
    </sec>
    <sec id="sec-5">
      <title>Check termination condition:</title>
      <p>The computation ends if the termination condition is fulfilled. Typical
condition is that there are no or minimal changes in cluster memberships.</p>
      <p>The problem of initial placing of k typical points could be solved using several
techniques. The simplest ones are choosing random points or randomly chooses
k input points.</p>
      <p>Each pattern of the examined data set is then assigned to some typical point.
The appropriate one is determined as the one with the smallest Euclidean
distance.</p>
      <p>vu n
dE (x, y) = utX(xi − yi)2</p>
      <p>i=1</p>
      <p>When all patterns are processed the new typical points should be computed.
New typical point of a cluster is determined as a mean vector of patterns in
the group. The end of the algorithm usually becomes when no change in cluster
membership occurs in two subsequent iterations.
3</p>
      <sec id="sec-5-1">
        <title>CLASS – where clusters may arise and disappear</title>
        <p>A clustering method class is inspired in former method isodata which is in
comparison with k-means method able to refine the number of clusters during
the computation but it has some crucial disadvantages, mainly that many
parameters need to be set by user. Such approach usually leads to incorrect results
if data are not understood properly. On the other hand, the class method in
most cases determines the parameters from the input data.</p>
        <p>The class method proceeds in few steps. At the beginning, user has to
set three input parameters: maximum number of iterations gamma, minimum
number of members in a cluster thetan and the initial splitting threshold s0.
For the testing purposes we used a modified version of class method where it is
also possible to set the initial number of typical points K and choose arbitrary
method for placing them in the input space.</p>
        <p>The very first step in the clustering is assigning each pattern to exactly one of
the typical points. This proceeds using the k-means algorithm described above.
After this the class method goes through three following steps:</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Excluding small clusters:</title>
      <p>All clusters with less than thetan members which were not changed during
last two iterations are excluded from the subsequent analysis.</p>
    </sec>
    <sec id="sec-7">
      <title>Splitting clusters:</title>
      <p>In the mth iteration we have to compute the splitting threshold sm
1 − S0</p>
      <p>Sm = Sm− 1 + GAM A
In each cluster we need to compute deviations (dij ) between the typical point
and each pattern belonging to this typical point. Then, for the jth attribute
we can compute the average deviations Dj1 of patterns situated on the right
and Dj2 of the patterns situated on the left side.</p>
      <p>Dj1 =
1 k1</p>
      <p>X dij ,
k1 i=1</p>
      <p>Dj2 =
1 k2</p>
      <p>X dij ,
k2 i=1
where k1 and k2 are the numbers of points situated on the right and on the
left side of the typical point within given attribute. Now we are ready to
determine parameters controlling the partitioning of the clusters. For each
cluster we compute parameters a1 and a2 for patterns on the right side and
left side.</p>
      <p>a1 = max
j</p>
      <p>Dj1
max xij
,
a2 = max
j</p>
      <p>Dj2
max xij
where j = 1, 2, . . . , p and i denotes all points from the same cluster. If in mth
iteration holds: number of clusters is less than 2K, a1 &gt; Sm or a2 &gt; Sm and
number of processed patterns greater than 2(T HET AN + 1) then we split
the cluster with respect to attribute j where a1 or a2 is maximal. The newly
created clusters contain patterns from the right and left side of the typical
point within the jth attribute.</p>
    </sec>
    <sec id="sec-8">
      <title>Revoking clusters:</title>
      <p>To decide which cluster has to be revoked we need to compute average
minimum distance of clusters in advance.</p>
      <p>h
T AU = 1 X
h</p>
      <p>Di
i=1
where h is current number of clusters and Di is the minimum distance of
ith typical point from the other ones. If Di &lt; T AU for some cluster i and
number of clusters is greater than K2 we revoke ith cluster. Patterns belonging
to revoked cluster are dispersed to the nearest typical points.</p>
      <p>These steps proceed until clusters remaining unchanged or maximum number
of iterations gama is reached.
4</p>
      <sec id="sec-8-1">
        <title>Self-organizing map expresses the topology</title>
        <p>The self-organizing map (som) is an articfiial neural network based on an issue
of competitive learning. The net consists of a set A with n neurons, represented
with weight vectors wi. Furthermore, neurons are mutually interconnected and
these bindings form some topological grid (usually rectangular or triangular).
If we present a pattern x into this network then exactly one neuron could be
the winner and its weights are adapted proportionally to the pattern (the
neuron is then closer). Moreover, neurons from the neighbourhood of the winner
are adapted too, but not so intensively. Neighbourhood N (c) could be formally
denfied as set of neurons that are topologically near to the winner.</p>
        <p>The winner of the competition is determined as the neuron with the minimum
distance to the pattern.</p>
        <p>c = arg min{||x − wa||}</p>
        <p>a∈A</p>
        <p>Then, adaptation of the weights proceeds using the equation (2) generally
known as Kohonen’s rule.</p>
        <p>wji(t + 1) =
(wji(t) + hcj(t)(xi(t) − wji(t)) j ∈ N (c)
wji(t)
otherwise.</p>
        <p>Weight vectors for the next iteration t + 1 of the winner and neurons in the
neighbourhood are adapted in a way that current weights are modiefid (either
added or subtracted) with a variance of current weight and input pattern.</p>
        <p>Parameter hcj(t) is usually represented with unimodal Gauss function with
center in c, width σ (t) and maximal unit movement h0(t). Values of σ (t) and
h0(t) are decreasing in time – this corresponds with rough learning in the
beginning and nfie learning later.
(1)
(2)
hcj(t) = h0(t) exp − ||w2cσ − 2(wt)j||2</p>
        <p>One of the features of som is its topology preserving behaviour. This means,
that som tries to adapt weights of neurons to cover the most dense regions and
therefore som naturally finds data clusters. The limitation of som lies in fact
that it is designed to have number of neurons specified as the input parameter
and immutable during the learning process.
5</p>
      </sec>
      <sec id="sec-8-2">
        <title>A self-organizing map that grows</title>
        <p>
          The Growing Neural Gas (gng) method [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] is the modicfiation of the som where
number of neurons is not immutable input parameter but is changed during the
competition. Connections between neurons are not permanent as well. The result
of competition could be then set of separate neural networks covering some region
of the input data.
        </p>
        <p>In the beginning the network itself contains only two neurons a1 and a2
representing two randomly chosen input patterns. Denote set of neurons as A
and set of connections as C which is empty in the beginning.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Competition</title>
      <p>The pattern x is presented to the network. The winner s1 of competition
and the second nearest s2 neurons are determined using equation (1). If there
was not a connection between neurons s1 and s2 then it is created (C = C ∪
{(s1, s2)}). The age of the connection is set or updated to 0 (age(s1, s2) = 0).
The squared distance between the winner and the pattern is added to local
error variable.</p>
      <p>ΔE s1 = ||x − ws1 ||2</p>
    </sec>
    <sec id="sec-10">
      <title>Adaptation</title>
      <p>The weight vectors of the winner and its direct topological neighbours1 Ns1
are adapted by fractions b and n of the distance to the input pattern. This
is analogous to the Kohonen’s rule (equation (2)) described above.
Δw s1 = b(x − ws1 )
Δw i = b(x − wi)
∀i ∈ Ns1
The age of all connections leading from the winner neuron are increased by 1
(age(s1, i) = age(s1, i) + 1 for all i ∈ Ns1 ).</p>
    </sec>
    <sec id="sec-11">
      <title>Removing</title>
      <p>If there exist some connections with age greater than given amax then all
are removed. If this step results in neurons with no connections then remove
also these standalone neurons.</p>
    </sec>
    <sec id="sec-12">
      <title>Inserting new neurons</title>
      <p>If the number of processed patterns reached an integer multiple of given
parameter λ then new neuron is inserted using following steps:
1. First of all, the neuron p with the largest accumulated local error is
determined using following equation.</p>
      <p>p = arg max{Ea}</p>
      <p>a∈A
Among the neighbours of neuron p determine neuron r with largest
accumulated local error.
2. Insert new neuron q to the network (A = A ∪ {q}) and set its weight to
the mean value of p and r weight vectors.</p>
      <p>1
wq = 2 (wp + wr)
3. Insert new connection between new neuron q and neurons p and r (C =
C ∪ {(p, q), (r, q)}). Remove the old connection between neurons p and r
(C = C − { (p, r)}).
1 Note, neuron i is in direct topological neighbourhood of the winner c if there exists
connection (i, c) between these two neurons.
4. Local accumulated error variables of neurons p and r are decreased by
given fraction α .</p>
      <p>ΔE p = − αE p</p>
      <p>ΔE r = − αE r
The accumulated error variable of the new neuron q is set to the mean
value of neurons p and r accumulated error variables.
5. Local accumulated error variables of all neurons in the network are
decreased by given fraction β .</p>
      <p>ΔE a = − βE a
∀a ∈ A</p>
      <p>These several steps proceed until pre-denfied termination condition is met.
This could be some performace measure or usually net size.
6</p>
      <sec id="sec-12-1">
        <title>Examples and comparison</title>
        <p>As an example data for the subsequent clustering we will use data determined
with distribution depicted in gfiure (1(a)). The highlighted regions are those
where data are located and white regions contain no data. For the purposes of
clustering using k-means method and class method we have to generate the
points complying given distribution in advance. A set of 1000 points from given
distribution is depicted in (1(b)). For the usage with the som and gng methods
we may use continuous pseudo-random generator of patterns from the given
distribution2.</p>
        <p>(a) Distribution</p>
        <p>(b) Objects from distribution
2 The same generator with the same seed we have also used to obtain set of discrete
patterns for “classical” clustering. Without injury on generality we may say that
explored sets of patterns were practically same.</p>
        <p>First of all we have tested if k-means algorithm will produce similar
partitioning as som neural network. If the number of typical points and number of
neurons are set to actual number of clusters in data then both methods will
place its representatives to the centers of clusters. Results of clustering using
both methods with K = 4 are depicted in gfiure (2).</p>
        <p>(a) k-means
(b) som</p>
        <p>If the number of representatives is very slightly different from the actual
number of clusters then obtained results are hardly interpretable. If the number is
lesser then some representatives are trying to cover more clusters. If the number
is greater then extra representatives will join another representative and they will
cover one cluster with some more dense areas. The latter case may be incorrectly
explained as presence of more clusters. In fact, representatives started to explain
topology of clusters and besides nfids more fine grained clusters. Figure (3)
depicts situation where K was set to value 5 and 25.</p>
        <p>It is clear that som is able to discover clusters in the input space as well as
the k-means method.</p>
        <p>Now we can focus on comparing the methods where splitting and merging of
clusters is allowed. In case of neural net approach we may talk about net growth.
One of the basic disadvantages of som is the inability to adapt its size according
to proportions in the data (like k-means). On the other side, gng can either add
neurons where the local error is to large or remove neurons that have no relation
to the rest (no existing connection).</p>
        <p>The issue we are interested at this time is how the class method and the
gng method will proceed in clustering given data. Both methods are starting
with two representatives. The class method allows to add more representatives
in one iteration in contrast to gng which adds neurons after the pre-denfied
chunk of steps passes. So, we compared obtained partitionings at moment where
both methods had identical number of representatives.
(a) k-means with K = 5
(b) k-means with K = 25</p>
        <p>The very rfist comparison was taken when both methods reached 4
representatives (the situation is depicted in figure (4)). As it is predictable,
representatives are placed nearby the centers of the clusters to cover them as a whole.
(a) class</p>
        <p>More interesting results were reached when both methods started co cover
clusters at the nfier level. With 9 representatives we can distinguish little
different behaviour of both methods. The dislocation of representatives obtained
with class method can be seen as dislocation of centers of circles with similar
perimeter in an effort to cover all patterns – spherical clusters. On the other
side the gng proceeds with an extended goal. The gng algorithm is covering
all patterns with the neurons and moreover it covers the cluster’s topology with
the connections. We may see that in the gfiure (5(b)) there are three groups of
interconnected neurons. We may interpret this situation for example as “there
are three clusters with topology given by connections”, but it is not so
definitive. In fact, there are four clusters but the gng method does not discover them
appropriately. Nevertheless, 9 representatives is still too little.</p>
        <p>Much more interesting results we can see when the number of representatives
reaches value 21. The dislocation of representatives obtained by both methods
is very similar and representatives cover all clusters effectively enough – they
express the topology. But, there is remaining one more question – what is the
resulting partitioning? In case of the class method we have a set of non-related
representatives. In 2-dimensional space we can do some visualisations to decide
but in multi-dimensional space is the resulting partitioning uncertain. The
problem lies in fact that the number of clusters is not the only thing we want to know.
We need to know something more, it is how the clusters look like and how to
formally express the category the cluster represents. Another big issue is the
correctness of results obtained with these methods on multi-dimensional data.
(a) class</p>
        <p>The connections between neurons of gng are very helpful in this point of
view. They inform about which neuron belongs to which cluster. From the figure
(6(b)) we may decide on existence of 4 clusters and using the connections we can
also state some conclusions on their shape or topology. But note, in the network
could remain some edges that connect two neighbouring not so distant clusters.
Thus, the results obtained with gng method have to be evaluated carefully.
7</p>
      </sec>
      <sec id="sec-12-2">
        <title>Conclusions</title>
        <p>From the comparisons between k-means method and som neural network and
between the class method and gng neural network we see that utilizing neural
networks (with competitive learning) is good idea in the clustering domain. The
most important feature of such neural networks is their natural ability to nfid
dense areas in the input space. The extension of the basic som algorithm to
dynamically reflect relations in the data (possibility of the net growth) makes
neural networks even much more interesting.</p>
        <p>The results obtained using both types of methods show that neural networks
are able to give at least the same results. Moreover, the fact that som-like neural
networks are likely to preserve the topology can mean that the results could be
even better. Nevertheless, as using any other method for arbitrary knowledge
discovery, the results have to be interpreted very carefully to be correct. This
holds also in this case.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Everitt</surname>
            ,
            <given-names>B. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Landau</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Leese</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Cluster analysis</article-title>
          . Oxford University Press,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Fritzke</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <article-title>A growing neural gas network learns topologies</article-title>
          .
          <source>Advances in neural information processing systems</source>
          <volume>7</volume>
          (
          <year>1995</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Han,
          <string-name>
            <given-names>J</given-names>
            ., and
            <surname>Kamber</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Data mining: Concepts and techniques</article-title>
          . Morgan Kaufmann Publishers,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>A. K.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Dubes</surname>
          </string-name>
          , R. C.
          <article-title>Algorithms for clustering data</article-title>
          .
          <source>Advanced reference series</source>
          . Prentice hall,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>A. K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murty</surname>
            ,
            <given-names>M. N.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Flynn</surname>
            ,
            <given-names>P. J.</given-names>
          </string-name>
          <article-title>Data clustering: A review</article-title>
          .
          <source>ACM Computing Surveys</source>
          <volume>31</volume>
          ,
          <issue>3</issue>
          (
          <year>September 1999</year>
          ),
          <fpage>264</fpage>
          -
          <lpage>323</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kohonen</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          Self-organizing maps. Springer Verlag,
          <year>1984</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Lukasova´,
          <string-name>
            <surname>A.</surname>
          </string-name>
          , and Sˇarmanova´,
          <source>J. Metody Shlukoev´ Anaylz´y . SNTL</source>
          ,
          <year>1985</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Sˇ´ ıma, J., and
          <string-name>
            <surname>Neruda</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Teoreticek</surname>
          </string-name>
          <article-title>´ oat´zky neuronoyvc´hı ıts´´</article-title>
          . Matfyzpress,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>