=Paper= {{Paper |id=Vol-1743/paper14 |storemode=property |title=Social Spider Algorithm Approach for Clustering |pdfUrl=https://ceur-ws.org/Vol-1743/paper14.pdf |volume=Vol-1743 |authors=Harley Vera-Olivera,José Luis Soncco-Alvarez,Lauro Enciso-Rodas |dblpUrl=https://dblp.org/rec/conf/simbig/OliveraSE16 }} ==Social Spider Algorithm Approach for Clustering== https://ceur-ws.org/Vol-1743/paper14.pdf
                    Social Spider Algorithm Approach for Clustering


      Harley Vera-Olivera*                José Luis Soncco-Álvarez*                Lauro Enciso-Rodas*
     harleyve@gmail.com, jose.soncco.alvarez@gmail.com, lauro.enciso@unsaac.edu.pe
                          National University of San Antonio Abad del Cusco*




                       Abstract                                 of these problems there is little prior information,
                                                                such as statistical models. It is under these restric-
    Clustering is a popular data analysis tech-                 tions that clustering is particularly appropriate for
    nique to identify homogeneous groups of                     the exploration of interrelationships between data
    objects based on the values of their at-                    to make a preliminary evaluation of its structure
    tributes, used in many disciplines and                      (Jain et al., 1999). Thus new conditions imposed
    applications. This extended abstract of                     by Big Data presented new challenges at different
    our undergraduate thesis for obtaining the                  levels including clustering.
    engineer degree in informatics and sys-                        The term clustering is used in several commu-
    tems, presents an approach based on the                     nities to describe methods for grouping of un-
    Social Spider Optimization (SSO) algo-                      labeled data (Jain et al., 1999). Clustering is
    rithm for optimizing clusters of data, tak-                 the task of discovering groups and data structures
    ing as metric the sum of euclidean dis-                     that are in some way or another ”similar”, with-
    tances. Other important algorithms of the                   out using known structures (vijayalakshmi and
    literature were implemented in order to                     Renuka Devi, 2012). Intuitively, patterns within a
    make comparisons: K-means algorithm,                        group are more similar compared to those patterns
    and a Genetic Algorithm (GA) for Clus-                      belonging to a different group. Here, the goal is
    tering. Experiments were performed us-                      to develop an automatic algorithm that can accu-
    ing 5 datasets taken from the UCI Ma-                       rately classify an unlabeled dataset in groups.
    chine Learning Repository, each algorithm                      Recent literature classifies clustering algorithms
    was executed many times and then the fol-                   in hierarchical, partitioning, and overlapping (Xu
    lowing measures were calculated: mean,                      and Wunsch, 2009). The partitional algorithm di-
    median, minimum, and maximum values                         vides a dataset into a finite number based on cer-
    of the results. These experiments showed                    tain criteria known as a measure of fitness. The fit-
    that the SSO algorithm outperforms the                      ness measure affects directly the natural formation
    K- means algorithm, and it has results                      of the groups, once a measure is selected the task
    equally competitive as the GA. All these                    of the partition becomes an optimization problem.
    results were confirmed by statistical tests
                                                                   K-means algorithm is the most fundamental
    performed over the outputs of the algo-
                                                                concept of partitional grouping, was published in
    rithm.
                                                                1957 by Lloyd (Lloyd, 1982). In this case the
1   Introduction                                                minimization of the Euclidean distance between
                                                                its elements and the center of a cluster was con-
Clustering is useful in several analysis such as ex-            sidered as a criterion of optimization. Inspired by
ploration of patterns, machine learning including               K-means many algorithms were developed such
data mining, documents retrieval, image segmen-                 as: Bisecting K-means (Steinbach et al., 2000),
tation and pattern classification. However in many              sort-means (Phillips, 2002), X-means (Pelleg and
    This work was started when first and second authors         Moore, 1999), among others.
were undergraduate students at the National University of          Recent studies reveal a new trend, which was
San Antonio Abad del Cusco and was finished when the au-
thors were graduate students at the University of Brasilia,     named as stochastic algorithms with randomized
both receiving a CAPES scholarship.                             and local search meta-heuristic. The random pro-



                                                          114
cess generates arbitrary solutions that explore the        for the clustering problem are presented, also the
search space and are responsible for achieving             pseudo-code of the algorithms is presented; in
global solution (Nanda and Panda, 2014). The               Section V the experiments and results are showed,
first meta-heuristic inspired by nature was the ge-        a discussion of this results is presented, and also a
netic algorithm developed by Holland and his col-          statistical analysis is performed; finally in Section
leagues in 1975 (Holland, 1975). This algorithm            VI are presented the conclusions and future work.
is classified as evolutionary algorithm. On the
other hand, new bio-inspired optimization algo-            2       The Clustering Problem
rithms are being introduced, such is the case of the
                                                           According to Mirkin (Mirkin, 2012), clustering is
algorithm inspired by the social behavior of spi-
                                                           a discipline dedicated to reveal and describe the
ders (Cuevas et al., 2013) classified as swarm in-
                                                           structures of groups in a dataset and may define
telligence algorithm proposed in 2013, which had
                                                           four important involved concepts: data, structure
not been applied to the clustering problem until
                                                           groups, reveal a group structure, and describe a
our proposal.
                                                           group structure. The following definitions were
   This extended abstract of our undergraduate             taken from (Maulik and Bandyopadhyay, 2000;
thesis for obtaining the engineer degree in in-            De Falco et al., 2007; Karaboga and Ozturk, 2011;
formatics and systems (Vera-Olivera and Soncco-            Senthilnath et al., 2011).
Álvarez, 2016), presents an approach based on the            Suppose S = {x1 , x2 , . . . xn } is a set of N -
SSO algorithm for the clustering problem. The              dimensional n points and C = {c1 , c2 , . . . , ck } is
contribution of this work is to show that the SSO          a set of N -dimensional k elements. The clustering
algorithm can produce competitive results regard-          problem in a N -dimensional space RN consists in
ing classic approaches such as: (a) the k-means al-        partitioning the set S in a number k of clusters
gorithm, which was implemented as presented in             based on a similarity metric, where each cluster
(Maulik and Bandyopadhyay, 2000); and (b) a ge-            has as center an element ci from C.
netic algorithm approach for the clustering prob-             Suppose that Gi , i = 1, . . . , k, represents a
lem, which was proposed by Maulik and Bandy-               cluster, then the following properties hold:
opadhyay (2000). The metric used for the com-
parisons is the sum of euclidean distances of the              • Gi 6= , to i = 1, . . . , k;
elements of the clusters to their respective cen-
ter, this metric is the output of the algorithms.              • Gi \ G j =       , to i, j = 1, . . . , k, such that
For the experiments were used 5 datasets from                    i 6= j;
the UCI Machine Learning Repository, for each of
                                                                    S
                                                                    k
these datasets the algorithms were executed sev-               •         Gi = S
eral times, and then the following measures were                   i=1
calculated: mean, median, minimum, and max-                   The clustering metric that has been adopted in
imum values. This experiment showed that the               this work is the sum of the Euclidean distances of
SSO algorithm has better results compared to the           the points of a group to their respective center. The
ones obtained by the K-means algorithm, also the           definition of this clustering metric M for k clus-
SSO algorithm has equally competitive results as           ters G1 , G2 , . . . , Gk , is given by the following ex-
the GA. Additionally, a statistical analysis was           pression:
performed, since we are working with stochastic
algorithms, using the Kolmogorov-Smirnov test
and the Wilcoxon rank sum test as discussed in                                              k X
                                                                                            X
(Demšar, 2006), (Durillo et al., 2009), (Muñoz et            M(G1 , G2 , . . . , Gk ) =                kxj   ci k
al., 2011). The results of these statistical test con-                                      i=1 xj 2Gi
firmed the results of the experiments.
                                                           3       Algorithm Based on the Social
   This paper is organized as follow: in Section                   Behavior of Spiders
II, are given some definitions related to the clus-
tering problem; in Section III is given the original       Cuevas et al. (2013) proposed a new optimiza-
proposal of the SSO algorithm; in Section IV, de-          tion algorithm, called Social Spider Optimiza-
tails of our approach based on the SSO algorithm           tion(SSO), the development of this new algorithm



                                                     115
was guided by the operational principle of the so-         3.2     Modeling of Vibrations Through the
cial behavior of spiders. The SSO algorithm as-                    Community Network
sumes that the solution space is a community net-          The community network is used as a mechanism
work (spider web), where spiders interact to each          for transmitting information between the members
others. The main features of this approach are:            of the colony. This information is coded as small
                                                           vibrations that are critical for collective coordina-
  • Each solution within a space of solutions rep-
                                                           tion of all individuals. The vibrations are based on
    resents the position of a spider in the commu-
                                                           the weight and the distance of the spider that gen-
    nity network.
                                                           erated it. The vibrations that are perceived by an
  • Each spider receives a weight according to             individual i as a result of information transmitted
    the value of fitness solution that represents.         by an individual j are modeled by the following
                                                           expression:
  • The algorithm modeled two types of search                                                  2
    agents (spiders): male and female. Depend-                             V ib i,j = wj ⇤ e di,j
    ing on the genre each individual performs dif-
                                                             Where di,j is the euclidean distance between
    ferent types of operations that simulate their
                                                           spiders i and j. There are three special types of vi-
    social behavior within the colony.
                                                           brations that are considered in the SSO algorithm:
   An important feature of the colonies of social
                                                             • V ib i,c vibrations, where c is the closest
spiders is that they have a high number of female
                                                               member to i that has a higher weight com-
agents. This fact is simulated by defining the num-
                                                               pared to i (wc > wi ).
ber of females Nf randomly within the range of 65
to 90% of N , which is the number of elements of             • V ib i,b vibrations, where b is the individual
the total population. The number of males Nm is                who has the best weight (best fitness value)
calculated as the complement of Nf regarding N .               of the whole population S.
   The total population S is divided into two sub-
groups F and M . The group F is the set of female            • V ib i,f vibrations, where f is the female in-
spiders, and the group M is the set of male spiders.           dividual closest to i.

              F = {f1 , f2 , . . . , fNf }                 3.3     Initialization of Population
                                                           The SSO algorithm starts by initializing the set S,
            M = {m1 , m2 , . . . , mNm }                   which contains N spiders positions. Each posi-
where S = F [ M = {s1 , s2 , . . . , sN }                  tion fi or mi , is an n-dimensional vector contain-
                                                           ing the values to be optimized. These values are
3.1 Calculation of Fitness                                 distributed uniformly between the values, plow and
Each individual (spider) i of the population S re-         phigh , which are previously specified.
ceives a wi weight, that represents the quality of its     3.4     Cooperative Operators
solution. This weight can be calculated as follows:
                                                           3.4.1    Cooperative Operator for female
                   J(si )       worsts                              spiders
              wi =
                   bests        worsts                     To emulate the cooperative behavior of the female
                                                           spiders, a new operator is defined. The operator
where J(si ) is the fitness value calculated by eval-      considers the change in position of a female spider
uating the position of a spider si regarding the           i at each iteration, this change can be attractive
function J. The values worsts and bests consid-            or repulsive and is calculated by combining three
ering a maximization problem, are defined as fol-          elements:
lows:
                                                             • The first element considers the change re-
    bests = max(J(sk )), k 2 {1, 2, . . . , N }                garding the nearest member to i that has
                                                               the highest weight and produces vibration
   worsts = min(J(sk )), k 2 {1, 2, . . . , N }                V ib i,c ;



                                                     116
  • The second element considers the change re-           defines a probability of influence on the new off-
    garding the best individual of the population         spring. The probability of influence Psi is as-
    S that produces vibration V ib i,b ;                  signed using the roulette-wheel selection, which
                                                          is defined as follows:
  • The third incorporates a random movement.
                                                                                       wi
                                                                           Ps i = P
  The last three elements can be considered as one                                    j2T g   wj
movement, we use the ”+” symbol for attraction
and the ”-” symbol for repulsion. The change in           where si 2 T g .
position can be calculated as follows:                       A spider is a solution within the solution space,
                                                          so a new spider is formed by choosing values for
            fik+1 = fik ± movement                        each variable, this variable is chosen within the
                                                          values defined by the method of roulette. For ex-
where k represents the iteration number.                  ample let snew = {v1 , v2 , . . . , vn } be the new
3.4.2 Cooperative Operator for male spiders               spider, each variable vi is determined using the
To emulate the cooperative behavior of the male           method of roulette-wheel selection.
spiders, these are divided into two groups: dom-             Once a new spider snew was formed is com-
inant D and non-dominant N D. This division               pared with the worst spider sworst from the colony
is made according to its position respect to the          according to their weights, where wworst =
median of all male individuals. Individuals who           minl2{1,2,...,N } (wl ). If the new spider snew is bet-
have a weight that is above the median are con-           ter than the worst spider sworst , then sworst is re-
sidered dominants, otherwise they are considered          placed by snew . Otherwise, the new spider is dis-
non-dominant.                                             carded and the colony does not suffer alterations.
   For dominant males are defined two move-               If a replacement occur, the new spider takes the
ments: (a) a movement of attraction to the near-          genre and index from the spider replaced.
est female f that produces a vibration V ib i,f , and
                                                          4     Optimization Algorithm Based on
(b) a random movement. The last two movements
                                                                Social Behaviour Spiders for
can be considered as one, and then the change in a
                                                                Clustering Problems
dominant male can be calculated as follows:
                                                          As proposal we present an SSO (Cuevas et al.,
          mk+1
           i   = mki + D movement                         2013) approach to solve the clustering problem.
                                                          This optimization algorithm based on the social
where k represents the iteration number.
                                                          behaviour of spiders is a meta-heuristic algorithm
  For non-dominant males is defined just one
                                                          of general purpose, so it is necessary to modify
movement of attraction to the weighted average of
                                                          many elements of the algorithms such as the rep-
male spiders. Then the change in a non-dominant
                                                          resentation of the individuals, calculation of the
can be calculated as follows:
                                                          fitness function, etc. Below are presented the el-
         mk+1 = mki + ND movement                         ements on which it was necessary to make mod-
          i
                                                          ifications to the original algorithm proposed in
where k represents the iteration number.                  (Cuevas et al., 2013).

3.5 Mating operator                                       4.1    Representations of Spiders (Individuals)
Mating a colony of spiders is made between fe-            The first consideration to take into account is the
males and dominant males. So when a dominant              representation of each spider. Each spider (male
male mg finds a set of female spiders E g within          or female) represents a set of k clusters centers,
a range of mating r, it mates, forming a new off-         which is a feasible solution to the problem of clus-
spring Snew . This new offspring is generated from        tering.
the set T g , which is formed by the union of E g and        For      instance,        let      x           =
mg . When the set T g is empty, mating operation          {(10.5; 20.4), (15.2; 25.0)} be a spider
is canceled.                                              that contains k = 2 cluster centers that are
   The weight of each spider that is involved in          {(10.5; 20.4) and (15.2; 25.0)}, in this particular
the mating process, i.e. spiders from the set T g ,       case each center has dimension n = 2.



                                                    117
   Each spider of the initial population was gen-              Algorithm 1: Algorithm for calculating the
erated taking k random points of a given dataset,              fitness of a spider
where k is the number of cluster to be found.                    Input: An array of cluster centers C (spider
4.2 Distance between Two Spiders                                         C); a set D of n-dimensional m
                                                                         points; an integer k > 0 that
It is necessary to define the distance between two                       represents the number of clusters
spiders, since a spider is formed by a set of cluster            Output: Metric M of spider C
centers (each center formed by several points) and           1 Create the set of empty clusters
not by a set of points. So we define the distance                G = {G1 , G2 , . . . , Gk }
between two spiders as the sum of the euclidean              2 foreach point x of the set D do
distances between their centers of clusters.                 3       Assign the point x to the cluster Gi whose
   For instance, let a = {(ax1 ; ay1 ), (ax2 ; ay2 )}                center Ci is the nearest to x;
and b = {(bx1 ; by1 ), (bx2 ; by2 )} be two spiders
that have k = 2 clusters centers, with each center           4 foreach cluster Gi do
having dimension 2. Then the distance between                5     calculate a new center Ci⇤ ;
these two spiders will be:                                   6   Calculate the metric M for the set of clusters
                                                                 G as defined in Section 2;
      da,b = d((ax1 ; ay1 ), (bx1 ; by1 )) +
              d((ax2 ; ay2 ), (bx2 ; by2 ))
                                                             set T . In order to define the spider from which the
where d((ax1 ; ay1 ), (bx1 ; by1 )) is the Euclidean         new spider will inherit a cluster center, it is used
distance between the centers (ax1 ; ay1 ) and                the roulette-wheel selection.
(bx1 ; by1 ).
                                                             4.5    Substitution of Spiders
4.3 Fitness and Weight of a Spider
                                                             In order to decide which spiders will be replaced
The fitness of each spider, which is an indicator            by the new spiders produced in the mating stage,
of how good is the solution that this spider repre-          also is used the roulette wheel selection method,
sents, is calculated using the metric M. The aim             where spiders of the population with less weight
of the SSO algorithm is to minimize the fitness of           (greater fitness) have more probability to be re-
the population. Thus, a spider that has the mini-            placed. It is important to note that the weight of
mum fitness is the best within the population.               a spider have a negative correlation with respect
   The pseudocode for calculating the fitness of a           to its fitness value, since we are working with a
spider is presented in Algorithm 1. The weight of a          minimization problem and not with a maximiza-
spider i was re-defined, because fitness and weight          tion problem as originally proposed by (Cuevas et
have negative correlation, and it is calculated in           al., 2013).
the following way:                                              The pseudocode of our proposal is showed in
                       worsts       J(si )                   Algorithm 2.
               wi =
                       worsts       bests                    5     Experiments and Results
       bests = min( J(sk ) ), k 2 {1, 2, . . . , N }
                                                             To compare the algorithms were taken five dataset
      worsts = max( J(sk ) ), k 2 {1, 2, . . . , N }
                                                             from UCI (UCI Machine Learning Repository)
                                                             repository: Balance, Cancer-Int, Dermatology, Di-
  where J(si ) is the fitness value of the spider i          abetes, Iris.
that was calculated using Algorithm 1.                           The Balance dataset was generated to model
                                                             psychological experiments, each example is clas-
4.4 Mating of Spiders                                        sified as having the balance scale tip to the right,
In the mating stage was defined a mating set T               tip to the left, or be balanced. The attributes are the
which is formed by a dominant male spider and the            left weight, the left distance, the right weight, and
female spiders that are within its range of mating.          the right distance. The correct way to find the class
From this set T are created new spiders, a new spi-          is the greater of (left-distance * left-weight) and
der represents a set of cluster centers, where each          (right-distance * right-weight). If they are equal,
cluster center is inherited from a spider within the         it is balanced.



                                                       118
   Algorithm 2: Social Spider Optimization al-                        Table 1: Properties of datasets
   gorithm for the clustering problem                             Dataset      Size Attributes Classes
    Input: A dataset D of n-dimensional m
                                                                Balance         625           4                3
            points; an integer k > 0 that
                                                               Cancer-Int       699           9                2
            represents the number of clusters
                                                              Dermatology       366           34               6
    Output: Metric M of the clusters found
                                                                Diabetes        768           8                2
 1 foreach spider C of population P do
                                                                  Iris          150           4                3
 2      Choose randomly k points from dataset D
        and create the array C (spider C) of
        cluster centers;                                    GA, SSO) were executed 50 times . Each execu-
 3 Calculate fitness of population P ;                      tion of an algorithm returns the metric M of the
 4 Calculate weight of population P ;                       best solution found. Then, the following measures
 5 for i = 2 to numberGenerations do                        were calculated: average, median, minimum and
 6     Cooperative operator for female spiders;             maximum value of the results.
 7     Cooperative operator for male spiders;                  The results of the experiments for each dataset
 8     Mating operator;                                     are shown in tables 2, 3, 4, 5, and 6 where the best
 9     Replacement of spiders in P ;                        results are highlighted in bold.
10     Calculate fitness of population P ;
11     Calculate weight of population P ;                   Table 2: Results of the experiments for the dataset
12   Return fitness (metric M) of the best solution         Balance
                                                                             K-means    Genetic. Alg.   SSO Alg.
     found;
                                                                    Mean     1426,544     1423,860      1423,851
                                                                    Median   1425,804     1423,851      1423,851
                                                                   Minimum   1423,851     1423,851      1423,851
                                                                   Maximum   1442,669     1424,071      1423,851
   The Cancer-int dataset is one of three domains
provided by the Oncology Institute that has repeat-
edly appeared in the machine learning literature.           Table 3: Results of the experiments for the dataset
This data set includes 201 instances of one class           Cancer-Int
and 85 instances of another class.                                           K-means    Genetic Alg.    SSO Alg.
                                                                    Mean     2824,135     2820,319      2820,302
   In the Dermatology dataset is shown diagnoses                    Median   2824,136     2820,302      2820,302
of erythemato-squamos diseases.                                    Minimum   2824,136     2820,302      2820,302
                                                                   Maximum   2824,136     2821,138      2820,302
   Diabetes patient records were obtained from
two sources: an automatic electronic recording de-
vice and paper records. The automatic device had            5.1   Discussion
an internal clock to timestamp events, whereas the
                                                            In the experiments for the dataset Balance, see Ta-
paper records only provided ”logical time” slots
                                                            ble 2, we can see that SSO algorithm has the best
(breakfast, lunch, dinner, bedtime).
                                                            results respect to all measures. Furthermore, re-
   Finally Iris contains 3 classes of 50 instances          spect to the median and minimum values the SSO
each, where each class refers to a type of iris plant.      algorithm has the same values as the GA.
One class is linearly separable from the other 2;              From the results for the dataset Cancer-int,
the latter are NOT linearly separable from each             shown in Table 3, we can see that SSO algorithm
other.                                                      has the best results respect to all measures. Fur-
   More features about the datasets are shown in            thermore, respect to the median and minimum val-
the Table 1.                                                ues the SSO algorithm has the same results as the
   For the experiments the number of generations            GA.
was fixed at 100 for the three algorithms (K-                  In the case of Dermatology dataset, shown in
means, GA, SSO). For the case of GA and SSO                 Table 4, GA algorithm has the best results respect
algorithms the number of elements of their respec-          to all measures. Furthermore, respect to the me-
tive population was fixed at 100.                           dian and minimum values the SSO algorithm has
   The experiments were performed as follows:               the same results as the GA this result is similar as
for each dataset, the three algorithms (K-means,            the two previous experiments.



                                                      119
Table 4: Results of the experiments for the dataset           Table 6: Results of the experiments for the dataset
Dermatology                                                   Iris
                 K-means     Genetic Alg.   SSO Alg.                             K-means   Genetic Alg.   SSO Alg.
        Mean     1127,390     1092,353      1092,355                  Mean       102.495     97.223        97.222
        Median   1121,087     1092,341      1092,356                  Median     97.326      97.222        97.222
       Minimum   1092,644     1092,341      1092,341                 Minimum     97.326      97.222        97.222
       Maximum   1415,274     1092,373      1092,373                 Maximum     124.182     97.232        97.222



Table 5: Results of the experiments for the dataset           and the null hypothesis is assumed, this results is
Diabetes                                                      shown using the symbol ’s-’.
                 K-means     Genetic Alg.   SSO Alg.
                                                                 In the table 7 are shown the results of the sta-
       Mean      52072,244    49160,016     49159,956
       Median    52072,243    49160,214     49159,939         tistical test of Wilcoxon between the SSO and K-
      Minimum    52072,244    49157,441     49157,441
      Maximum    52072,244    49161,999     49165,111
                                                              means clustering algorithms. In this table we can
                                                              see that there is statistical difference between SSO
                                                              and K-means algorithms for all cases. So we can
   In the Table 5, we can see results of Diabetes             conclude that the SSO algorithm presents results
dataset, the results shown that SSO algorithm has             significantly better than K-means algorithm.
the best results respect to all measures. And, re-
spect to the minimum values the SSO algorithm
                                                              Table 7: Results of the Wilcoxon rank sum test
has the same results as the GA.
                                                              between the SSO algorithm and the K-means al-
   Finally in the results for the Iris dataset, shown
                                                              gorithm
in Table 6, SSO algorithm has the best results re-                    Dataset       SSO Alg.      K-means
spect to all measures too. Furthermore, respect                                      (median)      (median)
to the median and minimum values the SSO al-                         Balance         1423,851     1425,804          s+
gorithm has the same results as the GA.                             Cancer-Int       2820,302     2824,136          s+
                                                                   Dermatology       1092,356     1121,086          s+
                                                                     Diabetes       49159,939     52072,244         s+
5.2 Statistical Analysis
                                                                       Iris           97,222        97,326          s+
An additional statistical analysis was performed
for comparing the algorithms, since we are work-                 In the table 8 is shown the result of the statis-
ing with stochastic algorithms.                               tical test of Wilcoxon between the SSO and the
   The following methodology was used: first the              GA. In this table we can see that for most cases
Kolmogorov-Smirnov test was applied to deter-                 the SSO and GA algorithms have similar results.
mine whether results (of 50 executions) of each               Only in the case of Iris dataset exists statistical
algorithm have a normal distribution. After deter-            significance, but we can not conclude that an al-
mining that the algorithms do not have normal dis-            gorithm is better than another since they have the
tribution the non parametric Wilcoxon rank sum                same median, we can only say that the algorithms
test was applied to compare the medians of two                have different behavior.
algorithms. This methodology was discussed and
applied in others works (Demšar, 2006), (Durillo             Table 8: Results of the Wilcoxon rank sum test
et al., 2009), (Muñoz et al., 2011).                         between the SSO algorithm and the genetic algo-
   The Wilcoxon rank sum test is used to test the             rithm
null hypothesis (H0 ) that the samples (of 50 ex-                     Dataset       SSO Alg.         GA
                                                                                     (median)      (median)
ecutions) of two algorithms come from distribu-
                                                                     Balance         1423,851     1423,851          s-
tions with same medians. If the null hypothesis                     Cancer-Int       2820,302     2820,302          s-
is rejected the alternative hypothesis is assumed                  Dermatology       1092,356     1092,341          s-
(HA ) that the samples come from distributions                       Diabetes       49159,939     49160,214         s-
                                                                       Iris           97,222        97,222          s+
with different medians.
   A significance level of 5% (p value less or
equal than 0.05) was used for the Wilcoxon rank
                                                              6   Conclusions and Future Work
sum test. If the test is successful then the null
hypothesis is rejected and the alternative hypothe-           In this work, a SSO approach for the cluster-
sis is assumed, this result is shown using the ’s+’           ing problem has been proposed. For evaluat-
symbol. Otherwise p value is greater than 0.05                ing its performance experiments were performed



                                                        120
over 5 datasets from the UCI repository (Balance,             S. Lloyd. 1982. Least squares quantization in
Cancer-Int, Dermatology, Diabetes, and Iris), also              pcm. Information Theory, IEEE Transactions on,
                                                                28(2):129–137, Mar.
comparisons were performed with two classic ap-
proaches for the clustering problem: the k-means              Ujjwal Maulik and Sanghamitra Bandyopadhyay.
algorithm and a genetic algorithm for clustering.               2000. Genetic algorithm-based clustering tech-
   The experiments showed that the SSO algo-                    nique. Pattern Recognition, 33(9):1455 – 1465.
rithm has better results regarding the algorithm k-           Boris Mirkin. 2012. Clustering: a data recovery ap-
means, and regarding the genetic algorithm, the                 proach. CRC Press.
SSO algorithm has equally competitive results.
                                                              Daniel M Muñoz, Carlos H Llanos, LDS Coelho, and
All these results were validated statistically using            Mauricio Ayala-Rincón. 2011. Opposition-based
the non-parametric Wilcoxon rank sum test. Thus,                shuffled pso with passive congregation applied to fm
the main contribution of this work was to show                  matching synthesis. In Evolutionary Computation
that the SSO algorithm can produce competitive                  (CEC), 2011 IEEE Congress on, pages 2775–2781.
                                                                IEEE.
results when compared with classic algorithms.
   As future works, we will include comparisons               Satyasai Jagannath Nanda and Ganapati Panda. 2014.
with newer algorithms of the literature. Also, it               A survey on nature inspired metaheuristic algo-
                                                                rithms for partitional clustering. Swarm and Evo-
is interesting to include in the experiments bigger             lutionary Computation, 16:1–18.
datasets. Finally, additional experiments will be
performed using other metrics such as the Clas-               Dan Pelleg and Andrew Moore. 1999. Accelerating
sification Error Percentage used in others works:               exact k-means algorithms with geometric reasoning.
                                                                In Proceedings of the fifth ACM SIGKDD interna-
(De Falco et al., 2007), (Karaboga and Ozturk,                  tional conference on Knowledge discovery and data
2011) and (Senthilnath et al., 2011).                           mining, pages 277–281. ACM.
                                                              Steven J Phillips. 2002. Acceleration of k-means and
References                                                       related clustering algorithms. In Algorithm Engi-
                                                                 neering and Experiments, pages 166–177. Springer.
Erik Cuevas, Miguel Cienfuegos, Daniel Zaldvar, and
   Marco Prez-Cisneros.     2013.     A swarm opti-           J Senthilnath, SN Omkar, and V Mani. 2011. Clus-
   mization algorithm inspired in the behavior of the            tering using firefly algorithm: Performance study.
   social-spider. Expert Systems with Applications,              Swarm and Evolutionary Computation, 1(3):164–
   40(16):6374 – 6384.                                           171.

Ivanoe De Falco, Antonio Della Cioppa, and Ernesto            Michael Steinbach, George Karypis, Vipin Kumar,
   Tarantino. 2007. Facing classification problems              et al. 2000. A comparison of document cluster-
   with particle swarm optimization. Applied Soft               ing techniques. In KDD workshop on text mining,
   Computing, 7(3):652–658.                                     volume 400, pages 525–526. Boston.

Janez Demšar. 2006. Statistical comparisons of classi-       Harley Vera-Olivera and José Luis Soncco-Álvarez.
   fiers over multiple data sets. The Journal of Machine        2016. Algoritmo de optimización basado en el com-
   Learning Research, 7:1–30.                                   portamiento social de arañas para clustering. Uni-
                                                                versidad Nacional de San Antonio Abad del Cusco.
Juan J Durillo, José Garcı́a-Nieto, Antonio J Nebro,           Undergraduate Thesis for Obtaining the Engineer
  Carlos A Coello Coello, Francisco Luna, and En-               Degree in Informatics and Systems.
  rique Alba. 2009. Multi-objective particle swarm
  optimizers: An experimental comparison. In Evo-             M vijayalakshmi and M Renuka Devi. 2012. A survey
  lutionary Multi-Criterion Optimization, pages 495–            of different issue of different clustering algorithms
  509. Springer.                                                used in large data sets. In International Journal of
                                                                Advanced Research in Computer Science and Soft-
John H Holland. 1975. Adaptation in natural and ar-             ware Engineering, pages 305 – 307.
  tificial systems: An introductory analysis with ap-
  plications to biology, control, and artificial intelli-     R. Xu and D.C Wunsch. 2009. Clustering. Oxford
  gence. U Michigan Press.                                      Wiley.

Anil K Jain, M Narasimha Murty, and Patrick J Flynn.
  1999. Data clustering: a review. ACM computing
  surveys (CSUR), 31(3):264–323.

Dervis Karaboga and Celal Ozturk. 2011. A novel
  clustering approach: Artificial bee colony (abc) al-
  gorithm. Applied Soft Computing, 11(1):652–657.




                                                        121