<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evolutionary Computation on the Connex Architecture</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Istva´ n Lo˝ rentz</string-name>
          <email>istvan@splash.ro</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mihaela Mali¸ta</string-name>
          <email>mmalita@anselm.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ra˘ zvan Andonie</string-name>
          <email>andonie@cwu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Central Washington University</institution>
          ,
          <addr-line>Ellensburg, WA, USA, and</addr-line>
          ,
          <institution>Electronics and Computers Department, Transylvania University</institution>
          ,
          <addr-line>Bras ̧ov</addr-line>
          ,
          <country country="RO">Romania</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Computer Science Department, Saint Anselm College Manchester</institution>
          ,
          <addr-line>Manchester, NH</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Electronics and Computers Department, Transylvania University</institution>
          ,
          <addr-line>Bras ̧ov</addr-line>
          ,
          <country country="RO">Romania</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We discuss massively parallel implementation issues of the following heuristic optimization methods: Evolution Strategy, Genetic Algorithms, Harmony Search, and Simulated Annealing. For the first time, we implement these algorithms on the Connex architecture, a recently designed array of 1024 processing elements. We use the Vector-C programming environment, an extension of the C language adapted for Connex.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Evolutionary Algorithms (EA) are a collection of
optimization methods inspired from natural evolution
        <xref ref-type="bibr" rid="ref6">(Ba¨ck
1996)</xref>
        ,
        <xref ref-type="bibr" rid="ref2">(Back, Fogel, &amp; Michalewicz 1997)</xref>
        ,
        <xref ref-type="bibr" rid="ref4">(Back, Fogel,
&amp; Michalewicz 1999)</xref>
        . The problem is formulated as
finding the minimum value of an evaluation function over a set
of parameters defined on a search space. Well known
evolutionary techniques are: Evolution Strategy (ES), Genetic
Algorithms (GA), and Evolutionary Programming (EP). These
techniques are also related to stochastic search (e.g.,
Simulated Annealing (SA)), and they share the following
characteristics:
• Start with a random initial population.
• At each step, a set of new candidate solutions is generated
based on the current population.
• Based on some criteria, the best candidates are selected to
form a new generation.
• The algorithm is repeated until the solution is found, or a
maximum number of iterations reached.
      </p>
      <p>EAs are meta-heuristic, as they don’t make many
assumptions of the function being optimized (for example, they
do not require known derivatives). From a meta-heuristic
point of view, the function to be optimized is a
’blackbox’, only controlled by the input parameters and the output
value. Meanwhile, EAs are parallel by their nature.
Parallel implementations of optimization algorithms is generally
a complex problem and this becomes more challenging on
fine grained architectures with inter-processor
communication burdens.</p>
      <p>
        Our study focuses on implementation issues of EAs on a
recent massively parallel architecture - the Connex
Architecture (CA). The CA is a parallel programmable VLSI chip
consisting of an array of processors. Functionally, it is an
array/vector processor. It is not a dedicated, custom-designed
(ASIC) chip, but a general purpose architecture. The CA is
now developed by Vebris1. An older version was developed
in silicon by BrightScale, a Silicon Valley start-up company
in (see
        <xref ref-type="bibr" rid="ref17 ref8">( S¸tefan 2009)</xref>
        ).
      </p>
      <p>
        Several computational intensive applications have been
already developed on the CA: data compression (Thiebaut
&amp; S¸tefan ), DNA sequences alignment
        <xref ref-type="bibr" rid="ref24">(Thiebaut &amp; S¸tefan
2001)</xref>
        , DNA search
        <xref ref-type="bibr" rid="ref20 ref25 ref26 ref27">(Thiebaut, S¸ tefan, &amp; Mali¸ta 2006)</xref>
        ,
computation of polynomials (Thiebaut &amp; Mali¸ta ), frame rate
conversion for HDTV
        <xref ref-type="bibr" rid="ref20 ref27">( S¸tefan 2006)</xref>
        , real-time packet
filtering for detection of illegal activities
        <xref ref-type="bibr" rid="ref25 ref26 ref27">(Thiebaut &amp; Mali¸ta
2006)</xref>
        , neural computation
        <xref ref-type="bibr" rid="ref1 ref18">(Andonie &amp; Mali¸ta 2007)</xref>
        , and
Fast Fourier Transform
        <xref ref-type="bibr" rid="ref16">(Lo˝rentz, Mali¸ta, &amp; Andonie 2010)</xref>
        .
      </p>
      <p>
        We do not intend to compare the efficiency of different
EAs on the CA, but to provide the implementation
building blocks. The motivation and novelty of this work are
to expose the CA’s vector processing capability for
metaheuristic optimization algorithms. We will provide the
resulted performance results (instructions/operators) for
several optimization benchmarks. The code is written in C++,
using Vector-C, available at
        <xref ref-type="bibr" rid="ref1 ref18">(Mali¸ta 2007)</xref>
        , a library of
functions which simulate CA operations. We use simulation
because the floating-point version of the chip is still under
development.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Review of Evolutionary Algorithms</title>
      <p>We will first summarize the following standard
optimization algorithms: Evolution Strategy, Genetic Algorithms,
and Harmony Search, and Simulated Annealing. We will
describe them in a unified way, in accordance to the EA
general scheme from the Introduction.</p>
    </sec>
    <sec id="sec-3">
      <title>Genetic Algorithms</title>
      <p>
        In the original introduction of the ’Genetic Algorithm’
concept, described by
        <xref ref-type="bibr" rid="ref13">(Holland 1975)</xref>
        , the population of
’chromosomes’ is encoded as binary strings. Inspired from
biological evolution, every offspring is produced by selecting
two parents (based on their fitness), the genetic operators
are the cross-over and single-bit mutation. The theoretical
1http://www.vebris.com/
foundation of GA is the Schema Theorem. Since the
original formulation, GA evolved into many variants. We will
consider here only the standard procedure:
      </p>
      <sec id="sec-3-1">
        <title>Algorithm 1 Genetic Algorithm</title>
        <p>Initialize population, as M vectors over the {0, 1}
alphabet, of length N.
repeat</p>
        <p>Create M child vectors, based on:
1. Select 2 parents, proportionate to their fitness
2. Cross-over the parents, on random positions
3. Mutate (flip) bits, randomly
The created M child vectors will form the new
population, the old population is discarded.
until termination criterion fulfilled (solution found or
maximum number of iterations reached).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evolution Strategy</title>
      <p>
        Evolution Strategy is also a population based optimization
method, with canonical form written as (μ/ρ + λ)-ES. Here
μ denotes the number of parents, ρ the mixing number
(number of parents selected for reproduction of an offspring), λ
the number of offspring created in each iteration
        <xref ref-type="bibr" rid="ref7">(Beyer &amp;
Schwefel 2002)</xref>
        .
      </p>
      <p>Algorithm 2 Evolution Strategy algorithm (μ, λ)-ES
Initialize population Vμ = {v1, . . . , vμ}. Each
individual v of the parent population represents a vector of
N numbers encoding the decision variables (the search
space) of the problem. The population is initialized
randomly.
repeat</p>
      <p>Generate λ offspring v˜ forming the offspring
population {v˜1, . . . , v˜λ} where each offspring v˜ is generated
by:
1. Select (randomly) ρ parents from Vμ.
2. Recombine the selected parents a to form a
recombinant individual v˜.
3. Mutate the parameter set s of the recombinant.
Select new parent population (using deterministic
truncation selection) from either
- the offspring population V˜ λ (this is referred to as
comma-selection, usually denoted as (μ, λ)-selection),
or
- the offspring V˜ λ and parent Vμ population (this is
referred to as plus-selection, usually denoted as
(μ+λ)selection)
until termination criterion fulfilled (solution found or
maximum number of iterations reached).</p>
      <p>The specific mutation and recombination operations will
be presented later in this paper.</p>
    </sec>
    <sec id="sec-5">
      <title>Harmony Search</title>
      <p>
        Harmony Search (HS) is a meta-heuristic algorithm inspired
by musical composition
        <xref ref-type="bibr" rid="ref10">(Geem, Kim, &amp; Loganathan 2001)</xref>
        .
According to
        <xref ref-type="bibr" rid="ref28">(Weyland 2010)</xref>
        , HS is a particular case of the
(μ + 1) ES algorithm. In HS, the population, encoded as
vectors of real or integer numbers, is stored in a matrix. The
population size (number of rows) is fixed. Each new
candidate is created by a discrete recombination (identical to the
recombination of ES), or as a random individual. Mutation
is performed with given probability. A key parameter is the
the mutation ’strength’ (or bandwidth). The new individual
will replace the worst individual in the actual population if
it is ’better’ than this one.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Simulated Annealing</title>
      <p>Inspired from the physical process of annealing, SA allows
unfavorable decisions, when a controlling parameter called
’temperature’ is high.</p>
      <p>
        Over the iterations, the temperature is decreased and
the algorithm will asymptotically approach a stochastic hill
climbing. SA
        <xref ref-type="bibr" rid="ref14">(Kirkpatrick et al. 1983)</xref>
        can be implemented
over a population of (1 parent + 1 descendant), using the
uniform mutation presented later in this article.
      </p>
      <sec id="sec-6-1">
        <title>Algorithm 3 Simulated Annealing</title>
      </sec>
      <sec id="sec-6-2">
        <title>Initialize a random candidate solution V</title>
        <p>Set initial temperature, T = T 0
repeat
mutate (perturb) the existing solution, to create V’
compute Δ = f (V ′ ) − f (V )
if Δ &lt; 0 or U (0, 1) &lt; exp(−Δ/T ) then</p>
        <p>accept new candidate: V = V’
end if</p>
        <p>Reduce T
until termination criterion fulfilled (Acceptable solution
found or maximum iterations reached)
return V, f(V)</p>
        <p>U (0, 1) denotes an uniform random variable between
[0, 1].</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>The Connex-BA1024 chip</title>
      <p>We implement the previous optimization algorithms on the
CA, a massively parallel architecture known as the Connex
BA1024 chip. In this section we briefly introduce some of
the hardware characteristics of BA1024. As a first CA
implementation example, we will describe a random number
generator program. This generator will be used in our
subsequent applications.</p>
      <p>The CA is a Single Instruction Multiple Data (SIMD)
device with 1024 parallel processing elements (PEs), as well
as a sequential unit, which allows general purpose
computations. It contains standard RAM circuitry at the higher
level of the hierarchy, and a specialized memory circuit at
the lower level, the Connex Memory, that allows parallel
search at the memory-cell level and shift operations.</p>
      <p>Several CA chips can be integrated on the same board,
extending the length of processed vectors in increments of
1024, while receiving instructions and data from only one
controller. A controller oversees the exchange of data
between the two levels. Just as regular memory circuits, the
operations supported by the CA can be performed in
welldefined cycles whose duration is controlled by the current
memory technology, which in today’s technology is in the
1.5 ns range.</p>
      <p>The 1024 cells are individually addressable as in a
regular RAM, but can also receive broadcast/instructions or data
on which they operate in parallel at a peak rate of 1
operation per cycle. This general concept fits the
ProcessorIn-Memory paradigm. The cells are connected by a linear
chain network, allowing fast shifting of data between the
cells, as well as the insertion or deletion of data from cells
while maintaining the relative order of all the data. All these
operations are performed in a single memory cycle.</p>
      <p>The hardware performances of BA1024 are:
• Memory cycle: 1.5 ns.
• Computation: 400 GOPS at 400 MHz (peak performance)
• External bandwidth: 6.4 GB/sec (peak performance)
• Internal bandwidth: 800 GB/sec (peak performance)
• Power: ≈ 5 Watt
• Area: ≈ 50 mm2 (1024-EU array, including 1Mbyte of
memory and the two controllers).
• 65nm implementation</p>
      <p>Using a 16-bit arithmetic, the BA1024 computes the
scalar product of a 1024-tuple vector in 37.5 ns (26 million
scalar products/sec), and performs 1024 × 1024 matrix
multiplications in 40 ms (25 operations/sec). Adding up to 1024
numbers is done in 5 cycles. Multiplication is done in 10
cycles. The P = 1024 processing elements, each containing
512 registers, are interconnected in a ring. From an
algorithmic point of view, the chip can be considered as an array
of P = 1024 columns and M = 512 rows. By convention,
we represent it as an array of horizontal vectors. In C-style
row-major notation, A[i][j] denotes the i’th register inside
the j-th processing element.</p>
      <p>
        An important component of evolutionary algorithms is the
pseudo-random number generator. An ideal random number
generator should be
        <xref ref-type="bibr" rid="ref21">(Quinn 2003)</xref>
        : uniformly distributed,
uncorrelated, cycle-free, satisfy statistical randomness tests,
and reproducible (for debugging purposes). In addition,
parallel generators must provide multiple independent streams
of random numbers. We used the xorshift generator,
introduced by
        <xref ref-type="bibr" rid="ref19">(Marsaglia 2003)</xref>
        , with period 2128 − 1. The
random seed needs 4 integer vectors X [0], X [1], X [2], X [3] of
1024 elements each. Here is the C++ code of this
pseudorandom generator, using the Vector-C library:
v e c t o r &lt;u i n t &gt; x o r 1 2 8 ( v e c t o r &lt;u i n t &gt; X [ ] ) {
v e c t o r &lt;u i n t &gt; T ;
T = x [ 0 ] ˆ (X [ 0 ] &lt;&lt; 1 1 ) ;
T ˆ = ( T ˆ ( T &gt;&gt; 8 ) ) ;
T ˆ = X [ 3 ] ˆ (X [ 3 ] &gt;&gt; 1 9 ) ;
X [ 0 ] = X [ 1 ] ;
X [ 1 ] = X [ 2 ] ;
X [ 2 ] = X [ 3 ] ;
X [ 3 ] = T ;
r e t u r n T ;
      </p>
      <p>Vectors are in represented in uppercase and
initialized with seed values from the host computer (in Linux,
/dev/urandom). It is essential that each component of
the seed vector has a different, independent value. Once
initialized, the presented function generates 1024 independent
pseudo-random streams.</p>
      <p>On the CA, generating in parallel N &lt;= 1024 uniformly
distributed random numbers results in a linear speedup:
Sxor128 = Tsequential/Tparallel = N , where Tsequential is
sequential execution time and Tparallel is parallel execution
time.</p>
      <p>The randvN(σ) function returns a vector. Each
component of this vector is an independent random variable
with Gaussian distribution, 0 mean and σ standard
deviation. The CA lacks trigonometric and logarithmic
functions, used by the Box-Muller method for generating
normal distributed random numbers. Therefore, we used an
approximation method, based on the central limit theorem:
N (0, σ) ≈ σ</p>
      <p>P1k2=1 U (0, 1) − 6 , where U (0, 1) is the
uniform random number generator in the [0, 1] interval.</p>
    </sec>
    <sec id="sec-8">
      <title>Evolutionary operators on the CA</title>
      <p>We present the building blocks of an evolutionary algorithm
using the CA vector instructions. The control flow of the
algorithm is still sequential, but mutation and evaluation
operators are vectorized. The population is represented as a
matrix. Rows (individuals) are mapped as CA vectors and
use vectorial instructions for mutation, recombination, and
evaluation. A population is evaluated sequentially. The
vector length (max. number of decision variables of the search
space) is limited to 1024, while the population size is
limited by the number of CA rows. Horizontal mapping allows
efficient computation of fitness functions via the parallel CA
reduction operator.</p>
    </sec>
    <sec id="sec-9">
      <title>Recombination</title>
      <p>The recombination operator forms a new individual, based
on a set of parents in the existing population. Typically
the offspring will get a combination of the parents features.
There are many variants for the recombination, we will
present the commonly used ones in GA and ES: crossover
and discrete recombination.</p>
      <p>Crossover The crossover operation creates a new
individual by combining the features of two parents. In one-point
crossover, elements from the first parent vector are copied
up to a random position. Continuing from that position,
elements from the second parent vector are further copied.
We implement this using a vector selection mask of random
length (Fig. 1).
v e c t o r c r o s s o v e r ( v e c t o r X , v e c t o r Y) {
i n t p o s i t i o n = r a n d ( VECTOR SIZE ) ;
where ( i &lt; p o s i t i o n )
C = X ; e l s e w h e r e C = Y ;
r e t u r n C ;
}</p>
      <p>The rand(n) scalar function returns a random integer
in the range [0, n-1]. The statement where(condition) ...
elsewhere ... is a parallel-if construct available on CA. Index
i denotes the processor element. The expression is evaluated
in parallel on each P Ei, and a selection flag (predicate) is
set, which conditions the execution of the statements inside
the where block. The elsewhere block is executed after
the selection predicates are negated. For brevity, we omit
the vector element data type, which can be either integer or
float.</p>
      <p>To obtain a two-point crossover, we need to change the
condition inside where to use 2 parameters, denoting the
start and end splicing points:
where ( i &gt;=a &amp;&amp; i &lt;b )
e l s e w h e r e C = Y ;
C = X ;</p>
      <p>
        The above code can be generalized for uniform crossover
        <xref ref-type="bibr" rid="ref22">(Sywerda 1989)</xref>
        . In this case, for each position, a bit is
randomly selected from one of the parents. Uniform crossover
can implemented by changing the condition to
where ( r a n d v b ( 0 . 5 ) ) { . . . . }
where randvb(p) creates a Boolean vector, each bit
having value ’1’ with probability p.
      </p>
      <p>Discrete Recombination In ES, the recombination
operator uses information from ρ individuals. In discrete
recombination, each position of the candidate individual vector v′
is copied from the same position of a randomly chosen
parent: v′ (i) = vk(i). In this case, the HS algorithm uses a
recombination of the entire population.</p>
      <p>CA supports matrix-vector addressing (selecting a
different cell from each column, to form a new vector), which is
used for discrete-recombination.</p>
      <p>For N &lt;= 1024, the parallel speedup of the two
recombination operators is linear: Scrossover = N .</p>
    </sec>
    <sec id="sec-10">
      <title>Mutation</title>
      <p>Mutation involves changing a single, random position by a
given amount. In horizontal mapping, first we create a
selection mask, with a single ’1’ bit, on the k-th position, then
perform a vector + scalar operation, which will add only the
elements on the k-th position:</p>
      <p>In ES, the mutation operator alters the vector by a
random amount: v′ = v + N (0, σ2), where N (0, σ2) denotes
a random variable with normal distribution. Our Vector-C
function name is randvN(sigma). The σ2 variance
parameter controls the mutation strength:
v e c t o r m u t a t e E S ( v e c t o r X) {</p>
      <p>r e t u r n X + randvN ( s i g m a ) ;
}</p>
      <p>Since the single-bit mutation’s serial execution time is
constant, there is no speedup achieved by parallelization:
Smutate1bit = 1. On the other hand, the speedup for
ESmutation is linear: SmutateES = N , since each vector
element is affected.</p>
    </sec>
    <sec id="sec-11">
      <title>Fitness Function Evaluation</title>
      <p>In evolutionary techniques, evaluating the fitness functions
usually consumes most of the time (compared to the
mutation, selection), so it is crucial to implement it most
efficiently. The class of functions that can be efficiently
computed using vectorial instructions on the CA has the form:</p>
      <p>N
f (x1, x2, ...xN ) = M hi(xi−k, ..., xi, ..., xi+k)
i=1
(1)
where L is the parallel-reduction operator, k defines a
fixed-size neighborhood (independently of N ). Currently,
the CA supports parallel sum reduction. The hi() function
should depend only on the i-th variable and optionally on a
small local neighborhood, i − k, ..., i + k. This is due to the
constrain that processing elements (PEs) are interconnected
by a ring bus, so efficient communication is done only by
neighboring PEs (data-locality).</p>
      <p>
        In
        <xref ref-type="bibr" rid="ref17 ref8">(Mali¸ta &amp; S¸tefan 2009)</xref>
        , it is described how to
compose such a function on the CA, by combining data-parallel
and time-parallel computations, illustrated in Fig. 3. Such
Given the ’horizontal’ mapping of the population in the CA,
after evaluation, the fitness value (a scalar) is available to
the sequential unit. The selection decision operation is not
vectorized, it is done by the sequential unit by comparing or
sorting the scalar fitness values.
      </p>
      <p>Selection in Simulated Annealing To implement SA on
the CA, we use the mutate() and evaluate()
functions already presented. The SA-specific selection operation
(to choose between two solutions Vold, Vnew) is:
v e c t o r s e l e c t S A ( v e c t o r Vold , v e c t o r Vnew ,
f l o a t t )
d f = e v a l u a t e ( Vnew ) − e v a l u a t e ( Vold ) ;
i f ( d f &lt; 0 | | r a n d f ( ) &lt; exp (− d f / t ) )
r e t u r n Vnew ;
e l s e
r e t u r n Vold ;
{
}
The exp(−df /t) scalar function (Boltzmann factor) is
evaluated by the CA’s sequential unit. Function randf() returns
an uniform random variable in the [0,1) interval.</p>
    </sec>
    <sec id="sec-12">
      <title>Experimental Results</title>
      <p>In our experiments, we use two benchmark problems: the
generalized Rosenbrock function and the geometric distance
problem.</p>
    </sec>
    <sec id="sec-13">
      <title>The generalized Rosenbrock function</title>
      <p>
        This is a standard benchmark function used in optimization,
illustrated in Fig. 4. The generalized N -dimensional form
is
        <xref ref-type="bibr" rid="ref9">(De Jong 1975)</xref>
        :
      </p>
      <p>
        N−1
f (x) = X (1 − xi)2 + 100(xi+1 − xi2)2
i=1
∀x ∈ RN
(2)
The geometric distance problem arises in molecular
geometry: given a set of distances between pairs of atoms space,
determine each atom’s (x,y,z) coordinate. Although various
solutions exist, the problem can be tackled also as a global
optimization problem
        <xref ref-type="bibr" rid="ref12">(Grosso, Locatelli, &amp; Schoen 2009)</xref>
        .
We implement a simplified form of this problem, where each
coordinate is assumed to take only discrete values inside a
given bounding rectangle. The aim is to minimize
f (x1, ...xN ) = X (||xi − xj || − dij )2 ;
(3)
i6=j
for all (i, j) pairs for which dij is known, where xi ∈
D ⊂ Z3.
      </p>
      <p>To parallelize the evaluation function, we notice that the
list of distances must be distributed for each processing
element, since the CA does not support random-access
interprocessor communication. The pairs of points for which the
distances are known (as input data) represent the edges of
an undirected graph. We label the edges as e1...eN and the
vertices as x1, ...xV . Each edge is mapped onto its own
processor: ep ⇔ P Ep.</p>
      <p>To compute f (), we need for each pair the xi, xj , dij
variables. The i, j vertex indexes for processor p are noted by ip
and jp, (p = 1...N ).</p>
      <p>Note that some of the vertices must be shared between
processors. To implement this sharing, we use the following
method: Each PE p will hold the distance dp and the vertices
of the two nodes it connects xip , xjp . For example, in a
simple triangle case with three vertices, we have three edges
with labels e0: A - B, e1: B - C, e2: A - C (Fig. 5). To avoid
inter-processor communication during the iterations, since
each PE stores vertex data into private variables, we must
assure that the variables which represents the same vertex
on a different processor have identical values. We do this in
the following way:
1. The vertices are initialized to random values, at the
program initialization.
2. The vertices are distributed to each processor, each
processor stores a private copy.
3. Each vertex xi will have also associated a random number
generator stream ri.</p>
      <p>This data representation allows parallel evaluation of the
sum of the distances and parallel mutation of the vertex
coordinates. We present the flowchart of the computation in
Fig. 6.</p>
      <p>For example, to load the graph represented in Fig. 5, we
assign to each edge the corresponding P E. P E0 will
receive the data corresponding to edge 0: the coordinates of
points A,B and the distance d(A,B).</p>
      <p>To evaluate the distances, no inter-processor
communication is required. Each PE computes the distance between
the vertices it holds and subtracts from the known, input
distance. The parallel reduction step computes the sum of
squared differences, resulting a scalar fitness value.
v o i d e v a l u a t e D i s t ( v e c t o r Xi , Yi , D)
{
}
v e c t o r Dx , Dy ;
Dx=Xi [ k]−Xj [ k ] ;
Dy=Yi [ k]−Yj [ k ] ;
Dx ∗= dx ; Dy ∗= dy ;
Dx += dy ;
r e t u r n s u m A b s D i f f ( Dx , D ) ;
We measured the number of vectorial operations, for each
specific evolutionary operator, as well as some test functions
(see Table 1).</p>
      <sec id="sec-13-1">
        <title>Operation</title>
        <p>A+=B
xorshift 128
sumAbsDiffs
1-Point Crossover
Uniform Crossover
Uniform Mutation
HS Mutation
Rosenbrock
evaluateDist</p>
        <p>Tpar is parallel execution time, measured in units of
vectorial operations, Tseq is sequential execution time (number
of sequential operations; we used the instruction count
instead of physical time). The last column contains S, the
speedup Tseq /Tpar, running on N &lt;= 1024 processing
elements. We use a one-to-one data element - PE mapping.</p>
        <p>To accurately interpret these results, we have to
emphasize that we used instruction counts instead of cycle counts
simply because the floating-point version of the chip is still
under development. The results give a theoretical achievable
speedup when using the presented algorithms.</p>
      </sec>
    </sec>
    <sec id="sec-14">
      <title>Conclusions</title>
      <p>The meta-heuristic algorithms presented above are
dependent on the way initial data is organized. We used
horizontal mapping. Another choice is to map the population
vertically, by loading the population data as columns in the
CA. The vectorial instructions will operate in this case over
the corresponding variables of the entire population. By this
transposition, the previous parallel operations will become
serial, and parallelism will operate over the entire
population. However, in vertical mapping we cannot speed-up the
evaluation function by using the parallel sum instruction.
Since the evaluation function is the most time-critical, we
did not explore further the vertical mapping method, to
verify if there are benefits in other evolutionary blocks.</p>
      <p>The CA offers vectorial computational facilities which
are well suited for the implementation of evolutionary
algorithms. We plan to continue our experimental work and
test the efficiency of meta-heuristic optimization, including
on the CA itself (not just on the simulator).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Andonie</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , and Mali¸ta,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2007</year>
          .
          <article-title>The Connex ArrayTM as a neural network accelerator</article-title>
          .
          <source>In CI '07: Proceedings of the Third IASTED International Conference on Computational Intelligence</source>
          ,
          <fpage>163</fpage>
          -
          <lpage>167</lpage>
          . Anaheim, CA, USA: ACTA Press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Back</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Fogel</surname>
            ,
            <given-names>D. B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Michalewicz</surname>
          </string-name>
          , Z., eds.
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>Handbook of Evolutionary Computation</article-title>
          . Bristol, UK, UK: IOP Publishing Ltd.,
          <source>1st edition.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Back</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Fogel</surname>
            ,
            <given-names>D. B.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Michalewicz</surname>
          </string-name>
          , Z., eds.
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Basic</given-names>
            <surname>Algorithms</surname>
          </string-name>
          and Operators. Bristol, UK, UK: IOP Publishing Ltd.,
          <source>1st edition.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Ba</surname>
          </string-name>
          ¨ck, T.
          <year>1996</year>
          .
          <article-title>Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms</article-title>
          . Oxford, UK: Oxford University Press.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Beyer</surname>
          </string-name>
          , H.-G., and
          <string-name>
            <surname>Schwefel</surname>
          </string-name>
          , H.-P.
          <year>2002</year>
          .
          <article-title>Evolution strategies A comprehensive introduction</article-title>
          .
          <source>Natural Computing</source>
          <volume>1</volume>
          :
          <fpage>3</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>S¸ tefan</article-title>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2009</year>
          .
          <article-title>One-Chip TeraArchitecture</article-title>
          .
          <source>In Proceedings of the 8th Applications and Principles of Information Science Conference</source>
          , Okinawa, Japan.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>De Jong</surname>
            ,
            <given-names>K. A.</given-names>
          </string-name>
          <year>1975</year>
          .
          <article-title>An analysis of the behavior of a class of genetic adaptive systems</article-title>
          .
          <source>Ph.D. Dissertation</source>
          , Ann Arbor, MI, USA.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Geem</surname>
            ,
            <given-names>Z. W.</given-names>
          </string-name>
          ; Kim,
          <string-name>
            <surname>J. H.</surname>
          </string-name>
          ; and Loganathan,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2001</year>
          .
          <article-title>A New Heuristic Optimization Algorithm: Harmony Search</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>SIMULATION</source>
          <volume>76</volume>
          (
          <issue>2</issue>
          ):
          <fpage>60</fpage>
          -
          <lpage>68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Grosso</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Locatelli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Schoen</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Solving molecular distance geometry problems by global optimization algorithms</article-title>
          .
          <source>Comput. Optim. Appl</source>
          .
          <volume>43</volume>
          (
          <issue>1</issue>
          ):
          <fpage>23</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Holland</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1975</year>
          .
          <article-title>Adaptation in natural and artificial systems</article-title>
          . University of Michigan Press.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>Kirkpatrick</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gelatt</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          ; Jr.; and Vecchi,
          <string-name>
            <surname>M. P.</surname>
          </string-name>
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Optimization by Simulated Annealing</surname>
          </string-name>
          .
          <source>Science</source>
          <volume>220</volume>
          :
          <fpage>671</fpage>
          -
          <lpage>680</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>L o˝rentz, I.; Mali¸ta, M.;</article-title>
          and Andonie,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2010</year>
          .
          <article-title>Fitting FFT onto an energy efficient massively parallel architecture</article-title>
          .
          <source>In Proceedings of the Second International Forum on NextGeneration Multicore/Manycore Technologies, IFMT '10</source>
          ,
          <issue>8</issue>
          :
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          :
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>Mali¸ta, M., and S¸tefan</article-title>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2009</year>
          .
          <article-title>Integral parallel architecture &amp; Berkeley's Motifs</article-title>
          .
          <source>In ASAP '09: Proceedings of the 2009 20th IEEE International Conference on Applicationspecific Systems, Architectures and Processors</source>
          ,
          <volume>191</volume>
          -
          <fpage>194</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Mali</surname>
            ¸ta,
            <given-names>M.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>The Vector-C library on Connex (A software library for a Connex-like multiprocessing machine)</article-title>
          . http://www.anselm.edu/internet/ compsci/Faculty_Staff/mmalita/HOMEPAGE/ ResearchS07/WebsiteS07/.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Marsaglia</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Xorshift RNGs</article-title>
          .
          <source>Journal of Statistical Software</source>
          <volume>8</volume>
          (
          <issue>14</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>S¸ tefan</article-title>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2006</year>
          .
          <article-title>The CA1024: SoC with integral parallel architecture for HDTV processing</article-title>
          .
          <source>In 4th International System-on-Chip (SoC) Conference &amp; Exhibit, November 1- 2.</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Quinn</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Parallel Programming in C with MPI and OpenMP</article-title>
          .
          <string-name>
            <surname>McGraw-Hill Education</surname>
          </string-name>
          Group.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Sywerda</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <year>1989</year>
          .
          <article-title>Uniform crossover in genetic algorithms</article-title>
          .
          <source>In Proceedings of the third international conference on Genetic algorithms</source>
          ,
          <fpage>2</fpage>
          -
          <lpage>9</lpage>
          . Morgan Kaufmann Publishers Inc.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Thiebaut</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and S¸tefan,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Ziv-Lempel compression with the Connex Engine</article-title>
          .
          <source>Tech. Rep. 077</source>
          , Dept. Computer Science, Smith College, Northampton, MA, 01063,
          <year>January 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <surname>Thiebaut</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and S¸ tefan,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2001</year>
          .
          <article-title>Local alignment of DNA sequences with the Connex Engine</article-title>
          . In The First Workshop on Algorithms in
          <source>BioInformatics WABI</source>
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Thiebaut</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and Mali¸ta,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Fast polynomial computation on Connex Array</article-title>
          .
          <source>Technical Report 303</source>
          ,
          <string-name>
            <surname>Smith</surname>
            <given-names>College</given-names>
          </string-name>
          ,
          <year>November 2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>Thiebaut</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and Mali¸ta,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2006</year>
          .
          <article-title>Real-time packet filtering with the Connex Array</article-title>
          .
          <source>In Proceedings of the International Conference on Complex Systems</source>
          ,
          <volume>501</volume>
          -
          <fpage>506</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <surname>Thiebaut</surname>
            , D.; S¸ tefan, G.; and Mali¸ta,
            <given-names>M.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>DNA search and the Connex technology</article-title>
          .
          <source>In Proceedings of the International Multi-Conference on Computing in the Global Information Technology (ICCGI'06).</source>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Weyland</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>A rigorous analysis of the harmony search algorithm: How the research community can be misled by a ”novel” methodology</article-title>
          .
          <source>Int. J. of Applied Metaheuristic Computing</source>
          <volume>1</volume>
          (
          <issue>2</issue>
          ):
          <fpage>50</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>