<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Spherical Randomized Gravitational Clustering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jonatan Gomez</string-name>
          <email>jgomezpe@unal.edu.co</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elizabeth Leon</string-name>
          <email>eleonguz@unal.edu.co</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ALIFE Research Group</institution>
          ,
          <addr-line>Computer Systems</addr-line>
          ,
          <institution>Universidad Nacional de Colombia</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>MIDAS Research Group</institution>
          ,
          <addr-line>Computer Systems</addr-line>
          ,
          <institution>Universidad Nacional de Colombia</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Circular data, i.e., data in the form of 'natural' directions or angles are very common in a number of dierent areas such as biological, meteorological, geological, and political sciences. Clustering circular data is not an easy task due to the circular geometry of the data space. Some clustering approaches, such as the spherical k-means, use the cosine distance instead of the euclidean distance in order to measure the dierence between points. In this paper, we propose a variation of the randomized gravitational clustering algorithm in order to deal with circular data. Basically, we use the cosine distance, we modify the gravitational law in order to use the cosine distance and we use geodesics ('straight' lines in curved spaces) in order to move points according to the gravitational dynamic. Our initial experiments indicate that the spherical gravitational clustering algorithm is able to nd clusters in noisy circular data.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Circular data, i.e., data in the form of ’natural’ directions or angles, are
observations taken on compact Riemannian manifolds (curved spaces following
a Riemannian geometry) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Circular data are seen in dierent scientic areas:
wave directions in oceanography, directions of animal movement in biology, wind
directions in meteorology, rock fracture orientations in geology, periodic time in
economy and conformational angles obtained from the 3D coordinates of the
backbone chain of a protein in bio-informatics [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1,2,3</xref>
        ]. Modeling circular data (in
particular using machine learning or statistical approaches) is not an easy task
due to the curved geometry of the data space (Riemannian geometry). In
particular, some approaches in the directional statistics eld (the eld of statistics
that deals with circular data) consider the circular data as data drawn from a
set of distributions of von Mises-Fisher (vMF) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Some clustering approaches
(approaches that try to nd groups of similar points) in the data mining and
machine learning elds replace the euclidean distance with the cosine distance
(angular distance between points) in order to measure the dierence between
points in the Riemannian space. Such is the case of the spherical k-means [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Gomez et al. in [
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ] proposed a clustering algorithm (the randomized
gravitational clustering algorithm, Rgc), which is robust to noise and unsupervised
in the number of cluster, based on concepts of eld theory in physics . Basically,
the gravitational dynamics is determined by moving points (in an euclidean
space) according to the gravitational eld generated by other point randomly
selected. In this paper, we propose a variation of such randomized gravitational
clustering algorithm in order to deal with circular data. In this way, we replace
the euclidean distance with the cosine distance, we modify the gravitational law
in order to use the cosine distance and we use geodesics, the ’straight’ lines in
curved spaces, in order to move points according to the gravitational dynamic.
This paper is divided in 5 Sections. Section 2 sumarizes the Rgc algorithm
proposed by Gomez et al in [
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ]. Section 3 explains the changes introduced to the
Rgc algorithm in order to deal with circular datasets using a cosine distance.
Section 4 analyzes preliminar results obtained with Rgc. Finally, Section 5 outlines
some conclusions.
2
      </p>
      <p>
        Randomized Gravitational Clustering ( Rgc)
Gravitational clustering ( Gc) algorithms are considered agglomerative
hierarchical algorithms based on concepts of eld theory in physics [
        <xref ref-type="bibr" rid="ref8 ref9">8,9</xref>
        ]. The Gc
algorithm simulates the gravitational system obtained of considering data points
as initial particles with mass equal one in a space exposed to gravitational elds.
Gomez et al. in [
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ] proposed a Gc algorithm which is robust to noise and
unsupervised in the number of clusters (see Algorithm 1). Basically, points do
not change in mass and are not removed during the simulation of the system
dynamic. Two points are merged into virtual clusters using a union-disjoint set
structure [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] (lines 1-2, 7 and 10, i.e., functions Make, Find and Union), when
they are close enough (line 7). Gomez et al. estimated the greatest minimal
separation (d^) between N uniformly separated points 3, (here N is the size of the
data set), using the 2-dimensional hexagonal packing of circles approach [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and
used it as re-normalization factor to reduce the eect of the data set size in the
system dynamic. Moreover, instead of considering all points to move a point,
just another point is randomly selected and both points are moved (line 6)
ac0 13
d
b
dx!;y
      </p>
      <p>
        A )
is the vector (’straight’ line between points y and x). These equations are the
vectorial representation of the Newton Gravitational and Second Laws[
        <xref ref-type="bibr" rid="ref13 ref14">13,14</xref>
        ].
The big crunch eect (one single big cluster at the end) is eliminated by
introducing a cooling mechanism similar to the one used in simulated annealing (line
3 The problem of determining the optimal arrangement of points in such a way that
the greatest minimal separation between points is obtained, is an open problem in
Geometry [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
0
d
b
dx!;y
13
A ), here dx!;y
Algorithm 1 Randomized Gravitational Clustering
Rgc( x, G, 4(G), M, ")
x: Data set, G: Gravity strength, 4(G): Cooling factor, M: Iterations, ": Fusion distance
1. for i=1 to N do //Creates the Union-Disjoint cluster set
2. Make(i) //Each data points is initially a cluster
3. for i=1 to M do
4. for j=1 to N do
5. k = random point index such that k 6= j
6. Move( xj , xk ) //Move both points using gravity motion equation.
7. if dxj;xk " then Union( j, k) //Merges (virtually) to clusters
8. G = (1-4(G))*G //Reduces the gravity strength
9. for i=1 to N do //Canonical Union-Disjoint clusters set
10. Find(i)
11. return disjoint-sets
8). When the simulation is terminated, clusters are extracted if they have a
minimum number of points ( ). Finally, Gomez et al. determine an appropriated
value of G by using an extended bisection search algorithm [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]: the number
of clustered points qM (points that were assigned to some cluster with two or
more points), after some checking iterations of the Rgc algorithm M , is used as
indicator of the quality G, by comparing it against an expected value Q .
3
      </p>
      <p>Spherical Randomized Gravitational Clustering ( Sgc)
Two elements should be considered to apply the Rgc algorithm to circular data:
(i ) the system dynamic (gravitational eld denition and movement of points
in hyper-sphere surface) and ( ii ) the greatest minimal separation between
’uniformly’ separated points in the surface of the hyper-sphere.
3.1</p>
      <p>
        Spherical Gravity Law (Gravity Law in the n sphere surface)
Although classic Newton Gravitational Law is dened for Euclidean spaces
(spaces described by Euclidean geometry), it has been generalized to Curved
spaces (spaces described by Riemannian geometry) such as the surface of an
hyper sphere (in short n sphere surface) [
        <xref ref-type="bibr" rid="ref13 ref14">13,14</xref>
        ]. In such curved spaces (the
n sphere surface), a ’straight’ line becomes an arc (segment of circle) and is
called geodesic. Since there is a one to one correspondence between the chord
(the Euclidean straight line between points in a n sphere surface) and the
geodesic between them, and any Curved space behaves locally as an Euclidean
space (it is a manifold), it is possible to approximate the motion of a point due
to Gravitational Law and Newton motion law by using a cosine distance, the
Euclidean straight line vector and projecting the obtained point to the n sphere
surface (renormalizing). In this way, the moving function of a point y due to the
gravitational eld of other point x (line 6 in Algorithm 1), is computed using
Equation 1.
d
b
cdx;y
      </p>
      <p>A
where, cdx;y is the cosine distance between points x and y, x
Euclidean straight line between points x and y, and normalize ( !z) =
(k !zk is the Euclidean norm of vector !z).
3.2</p>
      <p>Greatest Minimal Separation between Points ( d^)
We analize the behavior of d^ in the 2 sphere surface (circle in two dimensions) to
nd a better estimation of d^ when dealing with a n sphere surface. Notice that,
the maximum distance between closest points is the cosine distance dened by
the angle 2N . Similar behavior is observed when working in a 3 sphere surface,
where, it is close to p2N . In this way, a rough approximation 4 of the angle dened
by two closest points (of a data set of N points) in a n sphere surface is provided
by n 2p1N . Therefore, an estimation of d^ in circular data is given by Equation 2.
(1)
!y is the
!</p>
      <p>Z
k !zk
db = k cd</p>
      <p>2
n p1N
(2)</p>
      <p>Here N is the number of data points in the n-sphere surface, cd is the cosine
distance and k is a correction factor due to high dimensionality of the data set
(in our case we set it to 2).
4</p>
    </sec>
    <sec id="sec-2">
      <title>Experiments</title>
      <p>
        We use the Sgc algorithm for nding clusters in dierent real and synthetic
circular data sets. Due to the lack of space, we show (as proof of concept) the
results obtained by Sgc on two 3D synthetic data sets with six clusters, each
cluster following a vMF distribution with dierent concentration parameter and
number of samples. The rst data set is free of noise while the second one contains
20% of noise. In this paper, we xed Q = j p2N k, M = 1 and = p2Q for
estimating the value of G, since these values are good approximations to the
ones proposed by Gomez et al in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and reduce the time complexity of this
estimation algorithm to lineal O (N ) respect to the number of data points.
      </p>
      <p>Figure 1 shows the evolution of the Srg on the clean synthetic data set
after 25, 50 and 100 iterations. When noisy points are not part of the data set,
while points are moved to their clusters centers (lower row Figure 1), clusters are
formed quickly and points are not asigned to incorrect clusters (upper row Figure
1). The algorithm stops at iteration 119 when no more clusters can be formed
due to the cooling factor. Clearly, the Sgc algorithm is able to nd cluster in
clean circular data sets.
4 By using the manifold property of the hyper-sphere surface, i.e. considering local
vecinities of the hyper-sphere surface as hyper-cubes.
Mining circular data (data in the form of ’natural’ directions or angles) is a
challenging task due to the curve geometry of the data space. However, it is
possible to accomplish this task by considering the clustering process as the
result of a dynamic system, in particular, the result of a gravitational dynamic
system. In this direction, we were able to generalize the Newton gravitational
law and Newton Second Motion law to curved space (like the surface of an
hypersphere) in order to use the cosine distance instead of the euclidean distance for
simulating such dynamic system in a curved space. Our results indicate that the
Spherical Gravitational Clustering algorithm is able to nd clusters in curved
spaces and in the presence of noise. Our future work will concentrate in using
the Sgc on analysis of protein structure and in higher dimensional spaces.
1
0.5
0
−0.5
−1
1 1 0.5
0.5
1
0.5
0
−0.5
−1
1 1 0.5
0 −0.5 −1 −1
1
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
1 0.5
1
0.5
0
−0.5
−1
1 0.5
0 −0.5 −1 −1</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Space and Space-Time Modeling of Directional Data</article-title>
          .
          <source>PhD thesis</source>
          , Duke University,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>K.</given-names>
            <surname>Mardia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Taylor</surname>
          </string-name>
          , and G. Subramaniam,
          <article-title>Protein bioinformatics and mixtures of bivariate von mises distributions for angular data</article-title>
          ,
          <source>Biometrics</source>
          , vol.
          <volume>63</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>505512</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>I. S.</given-names>
            <surname>Dhillon</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Sra</surname>
          </string-name>
          ,
          <article-title>Modeling data using directional distributions, tech</article-title>
          . rep., The University of Texas at Austin,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>K.</given-names>
            <surname>Mardia</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Jupp</surname>
          </string-name>
          , Directional Statistics . Jhon Wiley and Sons,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>K.</given-names>
            <surname>Hornik</surname>
          </string-name>
          , I. Feinerer,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kober</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Buchta</surname>
          </string-name>
          ,
          <article-title>Spherical k-means clustering</article-title>
          ,
          <source>Journal of Statistical Software</source>
          , vol.
          <volume>50</volume>
          , pp.
          <volume>122</volume>
          , 9
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>J.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dasgupta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.</given-names>
            <surname>Nasraoui</surname>
          </string-name>
          ,
          <article-title>A new gravitational clustering algorithm</article-title>
          ,
          <source>in Proceedings of the Third SIAM International Conference on Data Mining</source>
          <year>2003</year>
          , pp.
          <fpage>8394</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>J.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Nasraoui</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Leon</surname>
          </string-name>
          ,
          <article-title>Rain: Data clustering using randomized interactions between data points</article-title>
          ,
          <source>in Proceedings of the Third International Conference on Machine Learning and Applications (ICMLA</source>
          <year>2004</year>
          ) , pp.
          <fpage>250255</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>W. E.</given-names>
            <surname>Wright</surname>
          </string-name>
          , Gravitational clustering,
          <source>Pattern Recognition, no. 9</source>
          , pp.
          <fpage>151166</fpage>
          ,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>S.</given-names>
            <surname>Kundu</surname>
          </string-name>
          ,
          <article-title>Gravitational clustering: a new approach based on the spatial distribution of the points</article-title>
          ,
          <source>Pattern Recognition, no. 32</source>
          , pp.
          <fpage>11491160</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>T.</given-names>
            <surname>Cormer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leiserson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Rivest</surname>
          </string-name>
          , Introduction to Algorithms .
          <source>McGraw Hill.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. H. T. Croft,
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Falconer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Guy</surname>
          </string-name>
          , Unsolved Problems in Geometry . New York: Springer-Verlag,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. H. Steinhaus, Mathematical Snapshots, 3rd ed. New York: Dover,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>M. C. Hazewinkel</surname>
          </string-name>
          ,
          <source>Theory of Gravitation</source>
          . Springer,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. T. Padmanabhan,
          <source>Gravitation: foundations and Frontier</source>
          . Cambridge University Press,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>