Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) A hybrid chemical reaction optimisation algorithm for solving the DNA fragment assembly problem 1st Naima Saidi 2nd Abdesslem Layeb MISC Laboratory MISC Laboratory Constantine 2 University Constantine 2 University Constantine, Algeria Constantine, Algeria naima.saidi@univ-constantine2.dz layeb.univ@gmail.com Abstract—The DNA Fragment Assembly Problem (FAP) is a errors and the difficulties caused by the repetitive structure of combinational optimisation problem in bioinformatics which is genomes. Therefore, metaheuristic approaches are employed to the process of reconstructing the original DNA sequence from find good solutions efficiently. Chemical reaction optimization a set of fragments produced by a sequencing machine. It is an NP-Hard problem. Therefore, finding an exact solution in a (CRO) is a powerful population-based optimization algorithm polynomial-time is impossible. Metaheuristics-based algorithms proposed by [3]. It mimics what happens to molecules in can be used to provide a good solution in reasonable time. In a chemical reaction system microscopically. The CRO is a this paper, we have applied a Chemical Reaction Optimisation discrete metaheuristic which made it suitable for the DNA (CRO) algorithm combined with Simulated Annealing (SA) to fragment assembly problem. It has been successfully applied the DNA fragment assembly problem. The experimental results showed that CRO+SA is very competitive with the state-of-the-art for several combinatorial and real world optimization problems algorithms for this problem. such as : task scheduling in grid computing [4], the 01 Index Terms—Bioinformatics, DNA Fragment Assembly Prob- knapsack problem [5], max flow problem [6], the vehicle lem, Chemical Reaction Optimisation, Simulated Annealing routing problem [7], the energy conserving of sensor nodes in the design of wireless sensor networks [8], clustering I. I NTRODUCTION algorithms for wireless sensor networks [9], and multiple The deoxyribonucleic acid (DNA) is a double stranded helix sequence alignment [10]. that contains genetic information needed for the development In this paper, a chemical reaction optimization algorithm and functioning of almost all cells in a living organism. combined with a simulated annealing-based local search Each strand is constructed from four types of nucleotides: has been proposed to solve the DNA FAP. The simulated Adenine, Cytosine, Guanine, and Thymine. To determine the annealing-based local search have been used to enhance the sequence of these nucleotides, the process of DNA sequencing final solution obtained by the CRO algorithm. We have is applied. Since current DNA sequencing technologies are validated our algorithm by using three set of benchmarks: not able to read the whole DNA sequence, only much shorter Genfrag, Dnagen, and the f-seires. The experimental results fragments called ”reads”, the DNA fragment assembly is show that the algorithm can get better overlap score than other needed to reconstruct the original DNA sequence from these metaheuristics-based approach. reads. The remainder of this paper is organized as follows. In sec- The process of DNA sequencing starts with duplicating tion II, we give some basic concepts about the DNA fragment the original DNA sequence, then each copy is cut into short assembly problem. Section III presents the CRO algorithm. fragments at random points. After that, this biological material The manner of applying the CRO+SA on the DNA fragment is converted to sequences of Ts, Gs, Cs, and As using a assembly problem is detailed in section IV. Section V presents sequencing machine; this process is referred to as the shot- and discusses the experimental results obtained from applying gun sequencing. After the reads are obtained, an assembly the proposed approach on three set of benchmarks. Finally, approach is followed to merge these reads into a longer DNA Section VI concludes the paper. sequence. The main approaches to the DNA fragment assembly II. T HE DNA FRAGMENTS ASSEMBLY PROBLEM problem are: the Overlap-Layout-Consensus (OLC) which is especially used for assembling long reads obtained by the The DNA fragments assembly is one of the most difficult Sanger sequencing or the third generation sequencing, and phases of any DNA sequencing project. Due to the fact the De Brujin graph [1], this approach became popular for that long DNA sequence cannot be accurately and rapidly assembling the short reads produced by the next generation sequenced. DNA sequencing provides the necessary informa- sequencing. tion about the overlap to combine the reads back together. Unfortunately, the DNA fragment assembly is an NP- Therefore, the ultimate goal is to obtain a sequence as close Hard problem [2], even with the elimination of sequencing as possible to the original one [11]. The OLC approach for the DNA fragment assembly prob- and genetic algorithm with hill climbing (GA+HC). The lem proceeds in three phases: authors in [29] have presented a hybrid meta-heuristic based 1) Overlap: It consists in finding the longest common on Simulated Annealing and a genetic crossover operator. In overlapping between the suffix of a sequence and the [30] the authors have proposed a memetic PSO algorithms prefix of another one. This task requires the comparison based on two initialization methods: the Tabu Search (TS) and of all possible pairs of fragments. It is usually tackled simulated annealing (SA). In [31] the authors have proposed with a dynamic programming algorithm applied to semi- a collection of GA variations (Recentering-Restarting, Ring global alignments such as Smith-Waterman algorithm Species, and Island Model) in combination with each other. [12]. These algorithms also used heuristics, namely, 2-Opt and 2) Layout: The goal of this step is ordering the fragments to the Lin-Kernighan Heuristic. They studied the performance maximise the overlap scores calculated in the previous of these algorithms with different solution representations. phase. This NP-Hard problem [2] is the most difficult The most recent metaheuristic used for solving the DNA phase of the OLC approach. FAP is the crow search algorithm in hybridisation with the 3) Consensus: To determine the complete DNA sequence improved PALS (PALS2Many*) in [32]. The authors have using the layout generated in the layout phase. The used a modified version of the ordered crossover operator consensus is usually generated by applying the majority to adapt the continuous crow search algorithm to the DNA rule. For measuring the quality of a consensus, we can FAP; and in the local search they used only the movement use Eq. 1, where n is the number of fragments in that increase the overlap values not the number of contigs. In the target sequence. The coverage measures the data [11] a parallel hybrid algorithm between Particle Swarm Opti- redundancy. mization (PSO) and Differential Evolution (DE), (PPSO+DE) Pn have been proposed. length of the fragment i Coverage = i=1 (1) target sequence length III. T HE CHEMICAL REACTION OPTIMISATION ALGORITHM Since the DNA fragments assembly problem is an NP-Hard CRO is a recent population-based nature inspired meta- problem, Most assemblers are based on variations of a greedy heuristic. It mimics the process of transforming a set of algorithms, such as Phrap [13], TIGR Assembler [14], Celera unstable molecules through a sequence of elementary reactions Assembler [15], Velvet [16], and ABySS [17]. However, most into stable products. Every molecule has a molecular structure proposed metaheuristics dealt with the layout phase of the (w), a potential energy (PE), and a kinetic energy (ke); and OLC approach. Such algorithms based on Simulated annealing optional attributes that depends on the problem. Such as: the [18], tabu search [19], in [20], the authors have proposed a number of hits (NumHit), the minimum PE (MinPE), and the problem aware local search algorithm (PALS) to solve the minimum hit number (MinHit). Illustrations of these attributes problem with the object of achieving a maximum overlap value are presented in Table I . There are four types of reactions in and a minimum number of contigs, which is a sequence of the CRO, grouped into uni-molecular and multi-molecular: fragments with an overlap greater than a threshold between 1) Uni-molecular reactions them. In [21],the authors have proposed two modifications • On-wall ineffective collision to PALS to improve its accuracy and efficiency. In [22], • Decomposition the authors have studied the performance of different genetic algorithm operators for the DNA fragment assembly problem. 2) Inter-molecular reactions They found that the edge-recombination crossover used with • Inter-molecular ineffective collision conjunction with two specialised operators-which manipulate • Synthesis the contigs rather than the individual fragments, perform best. The four elementary reactions are described as follows. Luque and Alba in [18] have proposed a Genetic Algorithm, On-wall ineffective collision: In this reaction, a molecule Scatter Search and CHC algorithm for solving the DNA w hits the wall of the container, then bounces away remaining FAP. Due to its difficulty, scientists have opted to hybrid in one single unit. As a result, a new molecule w0 in the methods to solve this problem. Different hybridisation with neighbourhood of the first one is generated. The change is PALS were proposed in the literature: GA [23], cellular GA allowed only if [24], simulated Annealing [25]. The authors in [26] have proposed a hybrid PSO algorithm (HPSO). They used a a tabu P Ew + KEw ≥ P Ew0 (2) search algorithm for initialising the particels and a simulated we get annealing algorithm to improve the best solution obtained KEw0 = (P Ew + KEw − P Ew0 ) ∗ q (3) by the PSO algorithm. The authors in [27] have presented an Artificial Bee Colony (ABC) algorithm and Queen-bee where q ∈ [KELossRate, 1] , and (1 − q) represents the Evaluation based on Genetic Algorithm (QEGA) to tackle the fraction of KE lost to the environment when it hits the wall. problem of DNA fragment assembly for noisy and noiseless Decomposition: In the decomposition, a molecule hits the instances. In [28] the authors have proposed two algorithms wall and then decomposes into two or more pieces. in this namely genetic algorithm with simulated annealing (GA+SA) paper, we assume that a molecule w produces two molecules 0 TABLE I Let temp2 = (P Ew1 + P Ew2 + KEw1 + KEw2 )(P Ew 1 + T HE ATTRIBUTES OF A MOLECULE USED IN CRO WITH ITS MEANING 0 P Ew2 . Attribute Meaning We get Molecular structure (w) Represents a solution of the problem. 0 KEw 1 = temp2 × q Potential energy (P E) Defines the objective function value of the corresponding solution represented by w. and Kinetic energy (KE) A non-negative number, it quantifies KE 0w2 = temp2 × (1 − q) the tolerance of the system accepting a worse solution than the existing one. Number of hits (numHit) Indicates the total number of collisions where p is a random number uniformly generated from the the molecule has been involved in. interval [0, 1]. Minimum PE (minP E) Means the best objective function value Synthesis: A synthesis happens when multiple (assume that the molecule has experienced. Minimum hit number Records the number of hits when a two) molecules hit against each other and fuse together. We (minHit) molecule obtains its latest minimum generate a new molecule w0 a quite different from the two P E. It is an abstract notation of the original molecules w1 and w2 . The condition for synthesis is period of time a molecule has stayed in a stable state. (KEw1 < β and KEw2 < β). The change is allowed only if P Ew1 + P Ew2 + KEw1 + KEw2 ≥ P Ew0 . (7) w1 and w2 . To perform the decomposition, the criterion We get (N umHit − M inHit > α) has to be fulfilled. The change is allowed if KEw0 = P Ew1 + P Ew2 + KEw1 + KEw2 − P Ew0 . P Ew + KEw ≥ P Ew1 + P Ew2 (4) IV. CRO+SA FOR THE DNA FRAGMENT ASSEMBLY PROBLEM or P Ew + KEw + buf f er ≥ P Ew1 + P Ew2 (5) As previously mentioned, the present work uses a chemical reaction optimisation algorithm combined with a local search if 4 is satisfied we get for solving the layout phase in the OLC approach for the DNA KEw1 = temp1 × q fragment assembly problem. In fact, the use of exact methods for finding the optimal layout would be very time-consuming, and which motivates the use of metaheuristics. In this section, we KEw2 = temp1 × (1 − q) start by defining the aspects of the solution presentation and the objective function, then we present the use of the CRO and where temp1 = P Ew + KEw P Ew1 − P Ew2 and q is the simulated annealing algorithm to tackle the DNA fragment randomly generated from the interval [0, 1]. And if 5 is assembly problem. satisfied we get A. Solution representation KEw1 = (temp1 + buf f er) × q1 × q2 A solution in CRO is presented by the molecular structure and of a molecule. Solutions are represented by a permutation of N (the number of reads) integer encoding a sequence of fragment KEw2 = (temp1 + buf f er − KEw1 ) × q3 × q4 identifiers. Then the buffer is updated by B. Objective function buf f er = buf f er + temp1 + KEw1 + KEw2 Chemical reaction optimisation was originally designed to where q1 , q2 , q3 , and q4 are randomly generated from the tackle minimisation problems. In this paper, we adopt the interval [0, 1]. If 4 and 5 are not satisfied, the decomposition fitness function proposed in [22] by adding a minus sign to it, reaction does not hold and the molecule has its original as shown in Eq. 8 structure w. n−1 X Inter-molecular ineffective collision: In this reaction, two M inimise : F (S) = − w(fj , fj+1 ) (8) or more molecules (assume two) collide with each other j=1 and bounce away. This reaction is similar to the On-wall ineffective collision, we generate two molecule w10 and w20 Where S denotes a molecule, and w(fj , fj+1 ) is the overlap in the neighbourhoods of w1 and w2 respectively. The change score between the two adjacent fragments, calculated using a is allowed if dynamic programming semi-global alignment algorithm. The settings used were 1 for a match, 3 for a mismatch, and 2 for 0 0 P Ew1 + P Ew2 + KEw1 + KEw2 ≥ P Ew1 + P Ew2 . (6) a gap [33]. C. The search process points. Then, we swap all the fragments between the two points between the solutions w1 and w2 as shown In our approach, a chemical reaction optimisation algorithm in Figure 4. is used to search for the best layout that minimises the 4) Synthesis: We have used an enhanced edge recom- objective function Eq. 8. As summarised in the flowchart of bination operator [35]. This operator emphasizes the Fig. 1, the algorithm starts by defining the initial population, adjacency information instead of the order or position of and assigning values to the control parameters. After that, a the fragments in the layout. To create a new solution, we number of iterations are performed. In each iteration, we per- use the information contained in the ” edge table”, which form one of the four collisions of the CRO, then we check for is an adjacency table listing the connections between the any new minimum fitness value and record it. After a certain fragments found in the two molecules. Figure 5 shows number of iterations without improvements, we substitute the a graphical representation of the synthesis collision. worst 20% of the population with new solutions generated by the same method as the initial population. This process helps 3) The simulated annealing: The Simulated Annealing to maintain the diversity of the population. After a maximum (SA) algorithm is a well-known metaheuristic for solving number of iterations performed without improvements, we combinatorial optimisation problem, proposed in [36]. SA is finish the CRO process and use the simulated annealing to a local search algorithm inspired from the cooling process of improve the best solution obtained. More details are presented molten metals through annealing to find the optimal solution. in the following: It has been successfully applied to many difficult optimization 1) Population initialisation: To provide good and diverse problems such as the hybrid vehicle routing problem [37] and solutions, we have combined two strategies to set the values the competitive single and multiple allocation hub location of each solution. We used a greedy-based approach to set problems [38]. In our proposal, the SA starts with the solution the values of 60% of a solution and we assigned the rest provided by the CRO algorithm. It remains for some time randomly from the fragments that had not been selected at the same temperature while a fixed number of iterations before. This approach serves a twofold purpose. Firstly, the is performed. Then, the temperature is cold down. In each greedy approach helps to speed up the search. Secondly, the iteration, we choose one of three operators: random approach prevents the algorithm from stacking at a 1) Inversion: In this operator, we randomly select two local minimum. points. Then, we reverse the order of the fragments 2) Elementary reactions: To choose one of the four elemen- between them as shown in Figure 6. tary reactions, we first decide whether it is a uni-molecular or 2) Specific inversion: we invert the orientation of one an inter-molecular collision. To do this, we generate a random contig. To do this, we randomly select one point in the number t, in the interval of [0, 1]. If t is larger than molecoll, permutation and determine the contig that contains this it will result in an event of uni-molecular collision; Otherwise, fragment. Then, we reverse the order of the fragments an inter-molecular collision will take place. Next, we examine in the selected contig. [22]. Figure 7 show an example the criteria of decomposition or synthesis to decide which type of the specific inversion operator. of collision will take place. 3) Transposition: this operator moves a contig to a position between two adjacent contigs [22]. we randomly select 1) On-wall ineffective collision: Since this reaction is used two points in different contigs to determine the contig to make a small change in the molecular structure, it that will be moved and its new position. Then, we move is done by selecting a fragment that does not have the contig as shown in figure 8. an overlap greater than a threshold with its adjacent fragments, or selecting a random one.To do this, we V. E XPERIMENTAL RESULTS AND DISCUSSION generate a random number t, in the interval of [0, 1]. If t is larger than 0.2, it will result in a random selection. In this section, we evaluate the performance of our algo- Otherwise, we select an isolated fragment. Then, we rithm, implemented in Matlab R2015a and tested on a personal search for the best position for it in the permutation. computer with Intel(R) Core()TM i3-4005U CPU at 1.70GHz Figure 2 shows a graphical representation of on-wall 1.70GHz, 4GB RAM, running on Windows 8.1 64-bit. ineffective collision. We have tested our algorithm through applying it on thirty 2) Decomposition: In this reaction, we have used the half- benchmarks divided into three collections: GenFrag, DNAgen total change operator [34]. we have produced two new and f-series, Table II represents a summary of them. The first solutions from an existing one by keeping one half of column represents the name of the instance, the second column the existing solution for the new one and assigning the contains the original DNA sequence of each benchmark, the remaining half by piking randomly the values form the third column is the length of the original DNA sequence, other half. Figure 3 depicts the decomposition collision. the rest of the columns are the coverage, the mean fragment 3) Inter-molecular ineffective collision: In this operator, we length and the number of fragments. More details on these generate two molecule w10 and w20 in the neighbourhoods benchmarks can be found in [33]. All the benchmarks are of w1 and w2 respectively. to do this, we have used the available for downloading at http://chac.sis.uia.mx/fragbench/ two points crossover. we start by selecting randomly two descargas.php. Table III shows the parameters settings of Fig. 1. The proposed CRO+SA algorithm for the DNA fragment assembly problem TABLE II D ESCRIPTION OF THE THREE SET BENCHMARKS FOR THE DNA FAP USED IN THIS STUDY Instances Original DNA Seq. original Coverage Mean nbr of Seq. length Frag. Frag. length GenFrag instances x60189(4) 4 395 39 A cluster of fibronectin type III repeats X60189(5) 5 286 48 found in the human major 3835 X60189(6) 6 343 66 histocompatibility complex class III region X60189(7) 7 387 68 M15421(5) 5 398 127 M15421(6) A human apolipoprotein B gene 10089 6 350 173 M15421(7) 7 383 177 j02459(7) Complete nucleotide sequence of the cohe- 20000 7 405 352 sive ends of bacteriophage lambda DNA bx842596(4) Neurospora crassa DNA linkage group II 4 708 442 77292 bx842596(7) BAC clone B10K17 7 703 773 DNAgen instances acin1 2170 26 182 307 acin2 147200 3 1002 451 acin3 200741 3 1001 601 a human microbion bacterium ATCC 49176 acin5 329958 2 1003 751 acin7 426840 2 1003 901 acin9 156305 7 1003 1049 f-series f25-305 - - - 305 25 f25-400 - - - 400 25 f25-500 - - - 500 27 f50-315 - - - 315 50 f50-412 - - - 412 50 f50-498 - - - 498 50 f100-307 - - - 307 100 f100-415 - - - 415 100 f100-512 - - - 512 100 f508-354 - - - 354 508 f635-350 - - - 350 635 f737-355 - - - 355 737 f1343-354 - - - 354 1343 f1577-354 - - - 354 1577 Fig. 2. Illustration of on-wall ineffective collision the CRO and SA algorithms. The parameters values were Fig. 3. Illustration of decomposition collision determined empirically by observing the results obtained for different settings. Table IV gives the results of the first version of the CRO in [33]. The solutions that reached the optimal values are and the hybrid CRO+SA. It shows the results of the two presented in bold. The purpose of this experiment is to show algorithms in terms of the best, average, worst and standard the enhancement of the CRO algorithm by a local search such deviation values obtained over 30 independent runs. The as the simulated annealing. As we can see from Table IV, the optimal column presents the optimal fitness values obtained simulated annealing algorithm had clearly improved the results by the LKH algorithm [39], the LKH results were presented obtained by the CRO. These results validate that the use of a Fig. 7. Illustration of specific inversion operator . Fig. 4. Illustration of inter-molecular ineffective collision Fig. 8. Illustration of transposition operator local search with large-scale neighbourhood operators, which manipulate the contgs [22], improves the results. The low rates at which the specific operators are applied is due to the fact that the number of contigs in the solutions obtained by the CRO are low as shown in Table V. Furthermore, the random inversion operator also allows a large scale modification in a solution and prevent the generation of redundant solutions Fig. 5. Illustration of synthesis collision contrary to the specific operators. As mentioned above, Table V gives a summary of the number of contigs obtained by each algorithm. The threshold used to calculate the number of contigs is thirty. Simulated TABLE III T HE PARAMETERS OF THE CRO+SA ALGORITHM Parameter Value CRO parameters size of the initial population 50 initial kinetic energy 1000 α 400 β 10 kelossrate 0.2 molecoll 0.25 SA parameters Initial temperature 50 Inversion rate 0.85 Specific inversion rate 0.06 Fig. 6. Illustration of inversion operator Transposition rate 0.09 TABLE IV T HE RESULTS OF THE CRO AND CRO+SA ALGORITHMS CRO CRO+SA Benchmark Optimal Best Average Worst STD Best Average Worst STD GenFrag x60189(4) 11478 11478 11275.733 10937 188.388 11478 11478 11478 0 X60189(5) 14161 13753 13603.2 13402 90.603 14161 14161 14161 0 X60189(6) 18301 18111 17684.966 17036 267.729 18301 18301 18301 0 X60189(7) 21271 20700 20194.033 19432 311.661 21271 21260.4 21210 19.796 M15421(5) 38746 37975 36395.4 34975 615.427 38746 38740 38706 12.165 M15421(6) 48052 46696 44406.3 42373 1103.849 48052 48052 48052 0 M15421(7) 55171 54019 51316.533 48276 1687.53 55171 55158.233 55130 11.732 j02459(7) 116700 111951 107675.466 97086 3793.992 116700 116677.8 116612 23.649 bx842596(4) 227920 217308 211594.2 191079 7873.797 227914 227682.7 227127 179.206 bx842596(7) 445422 421482 391629.766 373515 15175.729 444518 442822.533 441238 760.515 DNAgen acin1 47618 43127 41679.466 40475 677.354 47613 47496.1 47320 68.017 acin2 151553 146360 143683.233 141490 1061.346 151456 151394.366 151361 22.48 acin3 167877 160182 159036.866 157584 708.128 167228 166980.166 166793 132.64 acin5 163906 157814 156551.3 155153 613.182 163426 163326.4 163247 43.914 acin7 180966 175869 175173.7 174057 08.463 180220 180065.466 179947 62.272 acin9 344107 327919 325452.7 323377 1177.81 343385 343247.5 343093 76.949 f-series f25(305) 596 592 586.333 584 2.07 596 596 596 0 f25 (400) 777 774 766.033 761 2.938 777 777 777 0 f25(500) 921 916 911.2 905 2.773 921 921 921 0 f50(315) 1581 1540 1514.933 1493 10.639 1581 1581 1581 0 f50(412) 1573 1529 1507.233 1486 10.167 1573 1572.2 1570 0.979 f50(498) 1570 1540 1515.333 1498 11.542 1570 1569.466 1568 0.845 f100(307) 2793 2652 2623.5 2589 14.046 2793 2788.433 2785 1.994 f100(415) 2860 2714 2690.5 2659 14.216 2860 2854.933 2851 2.112 f100(512) 2732 2577 2541.766 2508 15.658 2731 2727.6 2723 2.374 f508(354) 18112 16665 16428.166 15979 130.376 18007 17986.333 17948 15.08 f635(350) 22498 20578 20076.966 18880 387.379 22358 22282.166 22240 25.486 f737(355) 25218 23056 22541.9 21766 281.075 24948 24899.566 24865 22.819 f1343(354) 49042 44313 42530.4 40997 849.78 48121 47994.866 47913 51.031 f1577(354) 57373 52131 50165.966 47770 1228.46 56024 55896.7 55731 82.088 annealing had significantly reduced the number of contigs for the GenFrag and DNAgen benchmarks. However, for the f- series benchmarks, the CRO and CRO+SA algorithms have given the same results for the three first small instances. For the next six instances, the two algorithms reached the same best results but the average number of contigs is improved after applying the SA algorithm. For the large instances in this collection, the SA has reduced the number of contigs in the solutions obtained by the CRO algorithm. Table VI compares CRO+SA with the state-of-the-art algo- rithms for the DNA fragment assembly problem that have been applied on the same benchmarks. The algorithms presented in the table are LKH [39], PPSO+DE [11], QEGA [27], SA [40], PALS [20], SAX [29], RRG [31] and CSA+P2M [32]. The optimal solutions are presented in bold. We can see that the CRO+SA has given competitive results with LKH and CSA+P2M algorithms and has outperformed the rest of the algorithms in most instances. Indeed, The statistical Friedman test of Figure 9 represents a comparison of the results of the algorithms presented in table VI for the GenFrag and DNAgen benchmarks. As it is Fig. 9. Friedman test compares the algorithms presented in table VI for the GenFrag and DNAgen benchmarks shown in the Friedman test, there is not a significant difference between the CRO+SA, RRGA, CSA+P2M and the optimal solutions. Furthermore, the results of the hybrid genetic algo- rithm with the SA are close to the CRO+SA results. Figure TABLE V T HE NUMBER OF CONTIGS OBTAINED BY THE CRO AND CRO+SA 10 represents a statistical Friedman test compares the available ALGORITHMS results of the f-series benchmarks. The Friedman test shows that CRO+SA and CSA+P2M algorithms have no significant CRO CRO+SA difference from the best known results. Benchmark Best Average Worst Best Average Worst The experimental results shows that the hybrid CRO algo- GenFrag rithm with the simulated annealing gives good results for the x60189(4) 1 1.8 3 1 1 1 X60189(5) 2 3.333 5 2 2 2 DNA fragment assembly problem. As shown by the Friedman X60189(6) 2 3 5 1 1 1 test, our algorithm ranks third for the three sets of benchmarks. X60189(7) 2 3.366 6 1 1.166 2 M15421(5) 9 14.666 19 3 3.2 4 VI. C ONCLUSION M15421(6) 7 13.166 20 2 2 2 M15421(7) 5 14 27 3 3 3 In this work, we have proposed a hybrid CRO algorithm j02459(7) 16 28.566 57 2 2 2 with simulated annealing for solving the layout phase of the bx842596(4) 33 48.2 71 9 10.333 14 OLC approach for the DNA fragment assembly problem. The bx842596(7) 37 84.233 112 5 9.833 14 DNAgen experimental results show that combining CRO with SA can acin1 28 48.2 61 6 8.766 13 achieve the best overlap scores for sixteen benchmarks out of acin2 118 172.166 189 65 71.1 78 thirty benchmark data sets and very encouraging results for acin3 182 235.066 275 83 92.866 104 acin5 204 254.233 325 107 117.333 131 the rest of them. acin7 245 286.366 339 116 130.733 144 As future work, it may be interesting to fine tuning and acin9 267 370.866 408 77 89.366 105 optimising our approach to improve the results. Since parallel f-series f25(305) 19 19 19 19 19 19 implementation of the CRO can be done easily [34], a parallel f25 (400) 14 14 14 14 14 14 version of the algorithm seems to be a possible option. It is f25(500) 14 14 14 14 14 14 also necessary to develop efficient optimisation algorithms to f50(315) 26 26.466 28 26 26 26 f50(412) 26 26.466 28 26 26 26 deal with noisy instances and real world scenarios. f50(498) 28 28.066 29 28 28 28 f100(307) 69 69.5 71 69 69 69 R EFERENCES f100(415) 65 66.1 69 65 65 65 [1] Pavel A Pevzner, Haixu Tang, and Michael S Waterman. An eulerian f100(512) 69 69.933 71 69 69 69 path approach to dna fragment assembly. Proceedings of the National f508(354) 273 279.433 285 260 260.7 263 Academy of Sciences, 98(17):9748–9753, 2001. f635(350) 357 366 390 332 334.166 337 [2] Pavel Pevzner. Computational molecular biology: an algorithmic ap- f737(355) 428 437.4 455 404 406.233 410 proach. MIT press, 2000. f1343(354) 707 749.033 781 657 662.933 672 [3] Albert YS Lam and Victor OK Li. Chemical-reaction-inspired meta- f1577(354) 856 889.233 936 810 816.966 827 heuristic for optimization. IEEE Transactions on Evolutionary Compu- tation, 14(3):381–399, 2010. [4] Jin Xu, Albert YS Lam, and Victor OK Li. Chemical reaction optimization for task scheduling in grid computing. IEEE Transactions on Parallel and Distributed Systems, 22(10):1624–1631, 2011. [5] Tung Khac Truong, Kenli Li, and Yuming Xu. Chemical reaction optimization with greedy strategy for the 0–1 knapsack problem. Applied Soft Computing, 13(4):1774–1780, 2013. [6] Reham Barham, Ahmad Sharieh, and Azzam Sliet. Chemical reaction optimization for max flow problem. IJACSA) International Journal of Advanced Computer Science and Applications, 7(8), 2016. [7] Min Li, Zhongxiang Liao, Yiming He, Jianxin Wang, Junwei Luo, and Yi Pan. Isea: iterative seed-extension algorithm for de novo assembly using paired-end information and insert size distribution. IEEE/ACM transactions on computational biology and bioinformatics, 14(4):916– 925, 2017. [8] PC Srinivasa Rao and Haider Banka. Novel chemical reaction optimiza- tion based unequal clustering and routing algorithms for wireless sensor networks. Wireless Networks, 23(3):759–778, 2017. [9] PC Srinivasa Rao and Haider Banka. Energy efficient clustering algorithms for wireless sensor networks: novel chemical reaction op- timization approach. Wireless Networks, 23(2):433–452, 2017. [10] Rohit Kumar Yadav and Haider Banka. An improved chemical reaction- based approach for multiple sequence alignment. CURRENT SCIENCE, 112(3):527, 2017. [11] Guillermo M Mallén-Fullerton and Guillermo Fernandez-Anaya. Dna fragment assembly using optimization. In Evolutionary computation (CEC), 2013 IEEE congress on, pages 1570–1577. IEEE, 2013. [12] T Smith and M Waterman. ªidentification of common molecular subsequences. º j. Molecular Biology, 147:195–197, 1981. [13] P Green. Phrap, version 1.090518. Available at h ttp://phrap. org, 2009. Fig. 10. Friedman test compares the LKH, RRGA, CSA+P2M and CRO+SA [14] Granger G Sutton, Owen White, Mark D Adams, and Anthony R algorithms for the f-series benchmarks Kerlavage. Tigr assembler: A new tool for assembling large shotgun sequencing projects. Genome Science and Technology, 1(1):9–19, 1995. TABLE VI A COMPARISON OF THE CRO+SA AND EIGHT STATE - OF - THE - ART ALGORITHMS FOR SOLVING THE DNA FRAGMENT ASSEMBLY PROBLEM . benchmark LKH PPSO+DE QEGA SA PALS SAX RRGA CSA+P2M CRO+SA GenFrag x60189(4) 11478 11478 11476 11478 11478 11478 11478 11478 11478 X60189(5) 14161 13642 14027 14027 14027 14027 14161 14161 14161 X60189(6) 18301 18301 18266 18301 18301 18301 18301 18301 18301 X60189(7) 21271 20921 21208 21271 21210 21268 21228 21271 21271 M15421(5) 38746 38686 38578 38583 38526 38726 38694 38746 38746 M15421(6) 48052 47669 47882 48048 48048 48048 48052 48052 48052 M15421(7) 55171 54891 55020 55048 55067 55072 55071 55171 55171 j02459(7) 116700 114381 116222 116257 115320 115301 116487 116700 116700 bx842596(4) 227920 224797 227252 226538 225782 223029 227238 227920 227914 bx842596(7) 445422 429338 443600 436739 438214 417680 442385 445422 444518 DNAgen acin1 47618 47264 47115 46955 46876 46865 47460 47618 47613 acin2 151553 147429 144133 144705 144634 144567 151353 151545 151456 acin3 167877 163965 156138 156630 156776 155789 167431 167861 167228 acin5 163906 161511 144541 146607 146591 145880 163313 163891 163426 acin7 180966 180052 155322 157984 158004 157032 180177 180924 180220 acin9 344107 335522 322768 324559 325930 314354 343315 344078 343385 f-series f25(305) 596 - - - - - 596 596 596 f25(400) 777 - - - - - 777 777 777 f25(500) 921 - - - - - 921 921 921 f50(315) 1581 - - - - - 1578 1581 1581 f50(412) 1573 - - - - - 1572 1573 1573 f50(498) 1570 - - - - - 1570 1570 1570 f100(307) 2793 - - - - - 2780 2793 2793 f100(415) 2860 - - - - - 2846 2860 2860 f100(512) 2732 - - - - - 2717 2732 2731 f508(354) 18112 - - - - - 17936 18110 18007 f635(350) 22498 - - - - - 22239 22492 22358 f737(355) 25218 - - - - - 24844 25206 24948 f1343(354) 49042 - - - - - 48205 49012 48121 f1577(354) 57373 - - - - - 56300 57313 56024 [15] Eugene W Myers, Granger G Sutton, Art L Delcher, Ian M Dew, Dan P bly problem. In Evolutionary Computation, 2008. CEC 2008.(IEEE Fasulo, Michael J Flanigan, Saul A Kravitz, Clark M Mobarry, Knut HJ World Congress on Computational Intelligence). IEEE Congress on, Reinert, Karin A Remington, et al. A whole-genome assembly of pages 2651–2658. IEEE, 2008. drosophila. Science, 287(5461):2196–2204, 2000. [25] Gabriela Minetti, Guillermo Leguizamón, and Enrique Alba. An [16] Daniel R Zerbino and Ewan Birney. Velvet: algorithms for de novo short improved trajectory-based hybrid metaheuristic applied to the noisy dna read assembly using de bruijn graphs. Genome research, 18(5):821–829, fragment assembly problem. Information Sciences, 277:273–283, 2014. 2008. [26] Ko-Wei Huang, Jui-Le Chen, and Chu-Sing Yang. A hybrid pso- [17] Jared T Simpson, Kim Wong, Shaun D Jackman, Jacqueline E Schein, based algorithm for solving dna fragment assembly problem. In Steven JM Jones, and Inanç Birol. Abyss: a parallel assembler for short Innovations in Bio-Inspired Computing and Applications (IBICA), 2012 read sequence data. Genome research, 19(6):1117–1123, 2009. Third International Conference on, pages 223–228. IEEE, 2012. [18] Gabriel Luque and Enrique Alba. Metaheuristics for the dna fragment [27] Jesun Sahariar Firoz, M Sohel Rahman, and Tanay Kumar Saha. Bee assembly problem. International Journal of Computational Intelligence algorithms for solving dna fragment assembly problem with noisy and Research, 1(1):98–108, 2005. noiseless data. In Proceedings of the 14th annual conference on Genetic [19] Jacek Błażewicz, Piotr Formanowicz, Marta Kasprzak, Wojciech T and evolutionary computation, pages 201–208. ACM, 2012. Markiewicz, and J Wglarz. Tabu search for dna sequencing with [28] Jesun Sahariar Firoz, M Sohel Rahman, and Tanay Kumar Saha. false negatives and false positives. European Journal of Operational Hybrid meta-heuristics for dna fragment assembly problem for noiseless Research, 125(2):257–265, 2000. data. In Informatics, Electronics & Vision (ICIEV), 2012 International [20] Enrique Alba and Gabriel Luque. A new local search algorithm for Conference on, pages 652–656. IEEE, 2012. the dna fragment assembly problem. In European Conference on [29] Gabriela Minetti, Guillermo Leguizamón, and Enrique Alba. Sax: a new Evolutionary Computation in Combinatorial Optimization, pages 1–12. and efficient assembler for solving dna fragment assembly problem. In Springer, 2007. 2012 Argentine Symposium on Artificial Intelligence, 2012. [21] Abdelkamel Ben Ali, Gabriel Luque, Enrique Alba, and Kamal E [30] Ko-Wei Huang, Jui-Le Chen, Chu-Sing Yang, and Chun-Wei Tsai. A Melkemi. An improved problem aware local search algorithm for the dna memetic particle swarm optimization algorithm for solving the dna fragment assembly problem. Soft Computing, 21(7):1709–1720, 2017. fragment assembly problem. Neural Computing and Applications, [22] Rebecca J Parsons, Stephanie Forrest, and Christian Burks. Genetic 26(3):495–506, 2015. algorithms, operators, and dna fragment assembly. Machine Learning, [31] James Alexander Hughes, Sheridan Houghten, and Daniel Ashlock. 21(1-2):11–33, 1995. Restarting and recentering genetic algorithm variations for dna fragment [23] Enrique Alba and Gabriel Luque. A hybrid genetic algorithm for the assembly: The necessity of a multi-strategy approach. Biosystems, dna fragment assembly problem. In Recent advances in evolutionary 150:35–45, 2016. computation for combinatorial optimization, pages 101–112. Springer, [32] Mohcin Allaoui, Belaı̈d Ahiod, and Mohamed El Yafrani. A hybrid 2008. crow search algorithm for solving the dna fragment assembly problem. [24] Bernabé Dorronsoro, Enrique Alba, Gabriel Luque, and Pascal Bouvry. Expert Systems with Applications, 102:44–56, 2018. A self-adaptive cellular memetic algorithm for the dna fragment assem- [33] Guillermo M Mallén-Fullerton, James Alexander Hughes, Sheridan Houghten, and Guillermo Fernández-Anaya. Benchmark datasets for the dna fragment assembly problem. International Journal of Bio-Inspired Computation, 5(6):384–394, 2013. [34] Albert YS Lam and Victor OK Li. Chemical reaction optimization: A tutorial. Memetic Computing, 4(1):3–17, 2012. [35] Timothy Starkweather, S McDaniel, Keith E Mathias, L Darrell Whitley, and C Whitley. A comparison of genetic sequencing operators. In ICGA, pages 69–76, 1991. [36] Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983. [37] F Yu Vincent, AAN Perwira Redi, Yosi Agustina Hidayat, and Ok- taviyanto Jimat Wibowo. A simulated annealing heuristic for the hybrid vehicle routing problem. Applied Soft Computing, 53:119–132, 2017. [38] Nader Ghaffarinasab, Alireza Motallebzadeh, Younis Jabarzadeh, and Bahar Y Kara. Efficient simulated annealing based solution approaches to the competitive single and multiple allocation hub location problems. Computers & Operations Research, 90:173–192, 2018. [39] Keld Helsgaun. An effective implementation of the lin–kernighan traveling salesman heuristic. European Journal of Operational Research, 126(1):106–130, 2000. [40] Gabriela Minetti and Enrique Alba. Metaheuristic assemblers of dna strands: Noiseless and noisy cases. In Evolutionary Computation (CEC), 2010 IEEE Congress on, pages 1–8. IEEE, 2010.