Restarting a Genetic Algorithm for Set Cover
         Problem Using Schnabel Census ?

                                    Anton V. Eremeev1,2
                    1
                       Dostoevsky Omsk State University, Omsk, Russia
    2
        The Institute of Scientific Information for Social Sciences RAS, Moscow, Russia
                                  eremeev@ofim.oscsbras.ru


          Abstract. A new restart rule is proposed for genetic algorithms (GAs)
          with multiple restarts. This rule is based on the Schnabel census method,
          originally developed for statistical estimation of the animal population
          size. It is assumed that during a number of latest iterations, the popula-
          tion of a GA was in the stationary distribution and the Schnabel census
          method is applicable for estimating the quantity of different solutions
          that can be visited with a positive probability. The rule is to restart a
          GA as soon as the maximum likelihood estimate reaches the number of
          different solutions observed at the last iterations.
          We demonstrate how the new restart rule can be incorporated into a
          GA with non-binary representation for the Set Cover Problem. Compu-
          tational experiments on benchmarks from OR-Library show a significant
          advantage of the GA with the new restarting rule over the original GA.
          On the unicost instances, the new restarting rule also turned out to be in
          advantage to restarting the GA as soon as the current iteration number
          becomes twice the iteration number when the best incumbent was found.

          Keywords: Multistart · Schnabel census · Maximum likelihood · Trans-
          fer of methods


1       Introduction
Genetic algorithms (GAs) are randomized search heuristics based on biological
analogy of selective breeding in nature, originating from the work of J. Holland
and applicable to a wide range of optimization problems. The basic components
of a GA are a population of individuals and random operators that are introduced
to model mutation and crossover in nature. An individual is a pair of genotype
g and phenotype x(g) corresponding to a search point in the space of solutions
D to a given optimization problem. Here g is a fixed length string of symbols
(called genes) from some alphabet A. The function x(g) maps g to its phenotype
?
    This research is supported by the Russian Science Foundation grant 17-18-01536.
     Copyright c by the paper’s authors. Copying permitted for private and academic purposes.
    In: S. Belim et al. (eds.): OPTA-SCL 2018, Omsk, Russia, published at http://ceur-ws.org
110     A. V. Eremeev

x(g) ∈ D, thus defining a representation of solutions in GA. The search in GAs is
guided by the values of the fitness function Φ(g) = φ(f (x(g))) on the genotypes
of the current population Π t on iteration t. Here f : D → R is the objective
function of a problem, and φ : R → R may be an identical mapping or a
monotone function, chosen appropriately to intensify the search. The genotypes
of the initial population are randomly generated according to some a priori
defined probability distribution.
    In this paper, a new restart rule is proposed for the GAs. The rule is based
on the Schnabel Census method, originally developed for statistical estimation
of size ν of animal populations [6] where one takes repeated samples of size 1 (at
suitable intervals of time) and counts the number of distinct animals seen. This
method was also adapted to estimate the number of local optima on the basis
of repeated local search [4]. Experiments [4] showed that the estimates based on
this approach were good for isotropic landscapes (i.e. those with uniform basin
sizes), but have a negative bias when basin sizes significantly differ.
    We show how the new restart rule can be incorporated into a GA with Non-
Binary Representation (NBGA) [3] for the Set Cover Problem (SCP). A detailed
description of this algorithm is provided in the appendix. Computational exper-
iments on benchmark instances from OR-Library show a significant advantage
of the GA with the new restarting rule in comparison to the original version of
the GA from [3]. In particular, given equal CPU time, in 35 out of 74 SCP in-
stances the new version of GA had greater frequency of finding optimal solutions
and only in 5 out of 74 instances the new version showed inferior results. The
new restarting rule also turned out to be in advantage to the well-known rule
of restarting the GA as soon as the current iteration number becomes twice the
iteration number when the best incumbent was found.
    The Schnabel Census method as a means of estimation of the number of un-
visited solutions [4], as well as the genetic algorithm, both emerged as a transfer
of ideas from biology into computer science. Interestingly enough, in the present
paper both methods are combined together. However Schnabel Census, originally
developed for estimation of animals population size, is not used for counting indi-
viduals here, but for estimation of the number of solutions which may be visited
if the distribution of offspring in the GA remains unchanged.


2     Restart Rule Based on Schnabel Census

One of the methods developed in biometrics for statistical estimation of size
of animal populations is the Schnabel Census method [6]. According to this
method, one takes repeated samples of size n0 (at suitable intervals of time)
from the same population and counts the number of distinct animals seen. The
usual assumption is that the probability of catching any particular animal is the
same. The sampled individuals are marked (unless they were marked previously)
and returned back into the population. Then statistical estimates for the total
number ν of individuals in population are computed on the basis of the num-
ber of already marked individuals observed in the samples. In what follows, we
                      Restarting Genetic Algorithms Using Schnabel Census      111

will apply the Schnabel Census method to estimate the number of values that a
discrete random variable may take with non-zero probability.
    Let r be a parameter that defines the length of a historical period that
is considered for statistical analysis. Given some value of a parameter r, the
new restart rule assumes that during the r latest iterations, the GA population
was at a stationary distribution and all tentative solutions produced on these
iterations may be treated analogously to sampled animals in the Schnabel Census
method. The Schnabel Census method is applied here to estimate the number ν
of different solutions that may be visited with positive probability if the current
distribution of offspring remains unchanged.
    In the sequel, we assume that in the latest r iterations of a GA we have a
sample of r independent offspring solutions and the random variable K is the
number of distinct solutions among them. In addition we make a simplifying as-
sumption that all solutions that may be generated in the stationary distribution
have equal probabilities. The rule consists in restarting the GA as soon as the
estimate ν̂ ML becomes equal to k. The value of parameter r is chosen adaptively
during the GA execution. The rationale behind this rule is that once the equality
ν̂ ML = k is satisfied, most likely there are no more non-visited solutions in the
area where the GA population spent the latest r iterations. In such a case it
is more appropriate to restart the GA rather than to wait till the population
distribution will significantly change.


3   The Genetic Algorithm for Set Covering Problem
In This Section, We Describe the Ga Application Which is Used for testing
the new restart rule. The Set Cover Problem (SCP) can be stated as follows.
Consider a set M = {1, . . . , m} and the subsetsS Mj ⊆ M, where j ∈ N =
{1, ..., n}. A subset J ⊆ N is a cover of M if j∈J Mj = M. For each Mj ,
a positive cost cj is assigned. The SCP is to find a cover of minimum summary
cost.
    The SCP may be formulated as an integer linear programming problem:

                         min{cx : Ax ≥ e, x ∈ {0, 1}n },                       (1)

where c is an n-vector of costs, e is the m-vector of 1s, A = (aij ) is an m × n
matrix of 0s and 1s, where aij = 1, iff i ∈ Mj . Using this formulation one
can regard SCP as a problem of optimal covering all rows of A by a subset of
columns.
    In this paper, we use the same GA for the Set Covering Problem as proposed
in our earlier work [3]. The GA is based on the elitist steady-state population
management strategy. It is denoted as NBGA because it uses non-binary repre-
sentation of solutions. The NBGA uses an optimized problem-specific crossover
operator, a proportional selection and a mutation operator that makes random
changes in every gene with a given probability pm . Each new genotype under-
goes greedy improvement procedures before it is added into the population. A
detailed description of this algorithm is provided in the appendix.
112     A. V. Eremeev

4     Results of Computational Experiments
The NBGA was tested on OR-Library benchmark problem sets 4-6, A-H, and
two sets of combinatorial problems CLR and Stein. The sets 4-6 and A-H consist
of problems with randomly generated costs cj from 1,...,100, while CLR and Stein
consist of unicost problems, i.e. here cj = 1 for all j. We compared three modes
of GA execution.
 – Mode (i): single run without restarts.
 – Mode (ii): restarting the GA as soon as the current iteration number becomes
   twice the iteration number when the best incumbent was found. This rule
   was used successfully by different authors for restarting random hill-climbing
   and genetic algorithms.
 – Mode (iii): restarting the GA using the new rule proposed in Section 2. The
   historic period used for statistical analysis was chosen as r latest iterations
   of the GA, where value r is chosen adaptively as follows: Whenever the best
   found solution is improved, r is set to be the population size. If the best
   incumbent was not improved during the latest 2r iterations, we double the
   value of r. Here we reset r to the population size, assuming that whenever
   the best incumbent is improved, the population reaches a new unexplored
   area and we should reduce the length of the historic period for analysis.
   If the best incumbent is not improved long enough, our rule extends the
   historic period. In order to reduce the CPU cost, the termination condition
   is checked only at those iterations when the value of r is updated. Fig 1
   illustrates a typical behavior of parameter r, together with the number of
   different offspring solutions k and the maximum likelihood estimate ν̂ ML
   over a single GA run, until the termination condition was satisfied.
    In our experiments, N = 30 trials of GA in each of the three modes were
carried out. The population size was 100 and the total budget of GA iterations
over all runs was equalPto  10 000.∗ A single experiment with the given budget we
                         30    k −f
call a trial. Let σ := k=1 f30f   ∗ · 100%, where fk is the cost of solution found
                     ∗
in k-th trial and f is the optimal cost. In what follows, Fbst will denote the
frequency of obtaining a solution with the best known cost from the literature,
estimated by 30 trials. The statistically significant difference at level p ≤ 0.05
between the frequencies of finding optimal solutions is reported below.
    For all non-unicost problems, we set the mutation probability pm = 0.1. Com-
paring the GA results in modes (i) and (iii), among 37 instances, where these
two modes yield different frequencies Fbst , mode (iii) has a higher value Fbst in
31 cases and in 16 out of these 31 cases the difference is statistically significant.
Mode (i) has a statistically significant advantage over mode (iii) only on a single
instance. Numbers of instances where mode (i) or mode (iii) had a higher fre-
quency of finding optima are shown in Fig. 2 (denoted “>freq”). This figure also
shows the numbers of instances where these modes had statistically significant
advantage (denoted “p < 0.05”).
    Modes (ii) and (iii) show different frequencies Fbst on 28 instances. On 16
of these instances mode (iii) has a higher value Fbst than mode (ii) and in 5
                       Restarting Genetic Algorithms Using Schnabel Census         113


Fig. 1. Dynamics of parameter r, the number of different offspring solutions k and the
maximum likelihood estimate ν̂ ML during 1300 iterations on instance CLR.13, until the
termination condition was satisfied (k = ν̂ ML = 82). Here r is reset to the population
size 100 when the best incumbent is improved in iterations 359, 430, 639 and 683.


out of these 16 cases the difference is statistically significant. Mode (ii) has a
statistically significant advantage over mode (iii) only on a single instance. In
terms of percentage of deviation σ, averaged over all instances of series 4-6,
A,C,D and E,F,G,H, mode (iii) gives the least error.


             Fig. 2. Comparison of the GA results in modes (i) and (iii)


   The three modes of running the GA were also tested on two series of uni-
cost SCP instances CLR and Stein. Again we use the population size of 100
individuals and the total budget of GA iterations, equal to 10 000 in each trial.
114     A. V. Eremeev


            Fig. 3. Comparison of the GA results in modes (ii) and (iii)


The mutation probability pm is set to 0.01 for all instances. On the unicost in-
stances, restart mode (iii) shows better or equal results compared to the other
two modes, except for a single instance CLR.13. On CLR.13, the best known so-
lution is found only in mode (i) and it took more than 4000 iterations. Mode (iii)
was irrelevant on this instance, which is probably due to a negative bias of the
maximum likelihood estimate ν̂ ML (see [4]).


5     Conclusions
For genetic algorithms, a new restart rule is proposed using the Schnabel Census
Method, originally developed to estimate the number of animals in a population.
Performance of the new restart rule is shown on a GA with a non-binary rep-
resentation of solutions for the set cover problem. Computational experiments
show a significant advantage of the GA with the new restart rule over the GA
without restarting and the GA restarting as soon as the current iteration num-
ber becomes twice the number of the current iteration. The new restart rule also
demonstrated the most stable behavior compared to the other two GA modes.
    Applying Schnabel Census in this research is an attempt to further benefit
from the convergence between computer science and biology. That interface had
been crucial for evolutionary computation to emerge, but was scarcely main-
tained systematically afterwards. As the present research shows, developing this
transdisciplinary integration can be productive.


Appendix
This appendix contains a description of the NBGA [3] for SCP. We will assume
that the columns are ordered according to nondecreasing values of cj -s and if
two columns have equal costs then the one which covers more rows stands first.
   Denote the set of columns that cover the row i by Ni = {j ∈ N : aij =
1}. The NBGA is based on a non-binary representation where the genotype
                         Restarting Genetic Algorithms Using Schnabel Census               115

consists of m genes g (1) , g (2) , . . . , g (m) , such that g (i) ∈ Ni , i = 1, 2, . . . , m.
In this representation x(·) maps a genotype g to the phenotype x(g), where
ones correspond to the columns present in genes of g. Obviously, it is a feasible
solution. For local improvements we use greedy heuristics that find approximate
solutions to a corresponding reduced version Pg of the given SCP. Problem Pg
has a matrix that consists of the columns of A represented in genes of g.
    An improved genotype is added to the population only if there are no in-
dividuals with the same phenotype in it yet. Genotypes of Π 0 are generated
independently and each gene g (i) is uniformly distributed over Ni .
    Consider the population on iteration t as a vector of genotypes of its indi-
viduals Π t = (g1 , g2 , . . . , gs ). Then the fitness function on iteration t is Φt (g) =
cx(gl(t) ) − cx(g), where l(t) is the index of the individual of the largest cover
cost in Π t .
    The selection operator in our GA implements the proportional selection
scheme, where the probability to choose the k-th individual is
                                                     s
                                                                !−1
                                                    X
                               p(gk ) = Φt (gk )       Φt (gl )     .
                                                l=1

   General outline of NBGA
1. While the initial population is not complete do
    1.1. Generate a random genotype g.
    1.2. Apply the column elimination procedure Prime to g and add g to the
         population.
2. For t:=1 to tmax do
    2.1. Choose the parent genotypes gu , gv with the help of the proportional
         selection.
    2.2. Produce an offspring g from gu and gv using a crossover operator.
    2.3. Mutate each gene of g with probability pm .
    2.4. Obtain genotype g 0 , applying column elimination procedures Greedy and
         Dual Greedy to g.
    2.5. If there are no individuals in Π t with the phenotype x(g 0 ), then
        2.5.1. substitute the least fit genotype in Π t by g 0 ,
        2.5.2. otherwise substitute the least fit genotype in Π t by g.

    The operators of crossover, mutation and the local improvement procedures
Prime, Greedy and Dual Greedy will be described in what follows.
    We use an optimized crossover operator, the LP-crossover, designed for SCP.
The goal of this operator is to find the best possible combination of the genes
of parent genotypes gu and gv , if it is possible without extensive computations.
Consider a problem of optimal crossover Poc , which is a reduced version of the
initial SCP but with the covering subsets restricted by the set of indices N 0 =
{j : x(gu )j = 1} ∪ {j : x(gv )j = 1}.
    First, in LP-crossover a trivial reduction is applied to Poc . Denote
                             S = {i ∈ M : |Ni ∩ N 0 | = 1}.
116     A. V. Eremeev

Then each row i ∈ S may be covered by a single column j(i) in  S Poc . We call
Q = {j : j = j(i), i ∈ S} a set of fixed columns. For each i ∈    Mj , one of
                                                                    j∈Q
the columns that cover the gene g (i) in Q is assigned. We refer to these genes
as fixed too. As a result of the reduction we obtain a new subproblem Pr , that
consists of the rows and columns that were not fixed during this procedure.
    Second, we solve the linear relaxation of Pr , i.e. an LP problem where the
Boolean constraints are replaced with the conditions xj ≥ 0 for all j. If the
obtained solution x0 turns out to be integer, then it is used to complete the
genotype g by an assignment of the non-fixed genes that corresponds to x0 . To
avoid time-consuming computations we do not solve Pr if the number of rows
in Pr exceeds a threshold µ (we use µ = 150). If this is the case or if the number
of simplex iterations exceeds its limit (equal to 300 in our experiments), or if the
solution x0 is fractional, then LP-crossover returns an unchanged genotype gu .
    The mutation operator works as follows: Suppose, i-th gene is to be mutated,
then the probability to assign a column j ∈ Ni to g (i) is
                                                  !−1
                                     1     X 1
                            pi (j) =                    .
                                     cj      ck
                                          k∈Ni


    We use three greedy-type heuristics to exclude redundant columns from a
solution. The most simple heuristic Prime starts with a given cover and discards
the columns in decreasing order of indices. A column is discarded only if the
remaining solution is still a cover. The second heuristic is the well-known Greedy
algorithm [2]. This algorithm may find a solution which is not minimal, therefore
Prime is run after it to eliminate redundant columns. The third heuristic we call
the Dual Greedy. It combines the successive columns discarding of Prime and
the adaptive columns pricing similar to Greedy. Denote the set of columns in the
subproblem Pg by N 0 := {j ∈ N : x(g)j = 1}. A cover J is obtained as follows.

    The Dual Greedy Algorithm
1. Set Ni0 := Ni ∩ N 0 for all i = 1, ..., m, M0 := M, J 0 := N 0 , J := ∅.
2. While J 0 6= ∅ do
       2.1. If there is i ∈ M0 such that |Ni0 | = 1, then
                     using the j ∈ Ni0 , set J := J ∪ {j}, M0 := M0 \(Mj ∩ M0 ).
            Otherwise
                                                   cj           ck
                     choose j ∈ J 0 such that |Mj ∩M                               0
                                                       0 | ≥ |M ∩M0 | for all k ∈ J .
                                                               k
                   0      0        0        0
       2.2. Set J := J \{j}, Ni := Ni \{j} for all i ∈ Mj .

   In Step 1.2 of NBGA we use the Prime algorithm, while Greedy and Dual
Greedy are used in Step 2.4 of NBGA (both heuristics are called and the best of
the two obtained solutions is accepted to construct a new genotype g 0 ).
   Before solving the non-unicost problems (where cj have different values) we
apply a core-refining reduction which is similar to the approach used in [1]. The
                          Restarting Genetic Algorithms Using Schnabel Census            117

                                                                          m
                                                                          S     (10)
reduction keeps only the columns belonging to the set Ncore =                  αi      , were
                                                                         i=1
 (10)
αi      is the set of the least 10 indices in Ni , i = 1, . . . , m.


References
     1. Beasley, J.E., Chu, P.C.: A genetic algorithm for the set covering problem. Euro-
        pean Journal of Operation Research 94(2), 394–404 (1996)
     2. Chvátal, V.: A greedy heuristic for the set covering problem. Mathematics of
        Operations Research 4(3), 233–235 (1979)
     3. Eremeev, A.V.: A genetic algorithm with a non-binary representation for the set
        covering problem. In: Proc. of OR’98. pp. 175–181. Springer-Verlag (1999)
     4. Eremeev, A.V., Reeves, C.R.: Non-parametric estimation of properties of com-
        binatorial landscapes. In: Cagnoni, S., Gottlieb, J., Hart, E., Middendorf, M.
        and Raidl, G. (eds.): Applications of Evolutionary Computing: Proceedings of
        EvoWorkshops 2002. LNCS. vol. 2279, pp. 31–40. Springer-Verlag, Berlin (2002)
     5. Paixao, T. Badkobeh, G., Barton, N. et al.: Toward a unifying framework for
        evolutionary processes. Journal of Theoretical Biology 383, 28–43 (2015)
     6. Seber, G.A.F.: The Estimation of Animal Abundance. Charles Griffin, London
        (1982)