1. Introduction and motivation

An approach to analysis of the similarity ofDNA-sequences

B F Melnikov

A Trenina

1 0 Russian State Social University , Wilhelm Pieck str. 4, Moscow, Russia, 129226 1 Togliatti State University , Belorusskaya str. 14, Togliatti, Russia, 445020

2018

63 72

This paper describes algorithms, corresponding computer programs and the results of computations, supplementing results published earlier. We consider the multiplesequence alignment problem, which can be nominated by a central problem in computatiobniaollogy. For it, we continue to consider some di erent versions of so-called “triangular norm” defined on the set of triangles formed by the dierent distance between genomes computed by di erent algorithms. Besides, one of the problems considered in biocybernetics is the problem roefconstructing the distance matrix between DNA sequences, when not all the elements of the matrix under consideration are known at the input of thealgorithm. In this connection, the problem arises that the developed method of comparative evaluation of algorithms for calculating the distances between sequences should be used for another problem, i.e., for reconstructing the matrix of distances between DNA sequences. In this paper, we consider the possibility of applying themethod of comparative evaluation of the algorithms for calculating the distances betweena pair of DNA strings that we developed and studied earlier for the reconstruction ofa partially filled distance matrix. The restoration of the matrix occurs as a result of several computationapl asses. Estimates of unknown matrix elements are averaged in a special way using so-called fruisnkctions, and the result ofthis averaging is considered as the received value of the unknown element.

1. Introduction and motivation

In this paper, we describe the continuation of research on the issues that we started in [ 1, 2, 3, 4, 5 ]. We describe algorithms, corresponding computer programs and the results of computations, supplementing results published earlier. We consider the multiple sequence alignment problem, which can be nominated by a central problem in computational biology. For it, we continue to consider some di erent versions of so-called \triangular norm". (The name \triangular metric", also sometimes encountered in our previous papers, is also quite possible and does not contain errors, we will not give detailed comments on this thing. However, the \norm" in our case is some more correct, both when we speak about the whole matrix, and when we speak only about the only triangle.) Such norm de ned on the set of triangles formed by the di erent distance between genomes computed by di erent algorithms. Thus, in this paper the word \metric" will be used only as the distance between the genomes, and the word \norm" as an indicator of the badness of a certain set of such distances.

In previous cited papers, we described Panin's metric and some algorithms for its improvement. In this paper, we consider some variants of investigation of various variants of the triangular norm. Let us remark, that both these directions are interrelated. Namely, we improve the interpretation of the Panin's metric in the same way as it is done in some interpretations of genetic algorithms: we are trying to achieve a combination of parameters, for which the triangular norm reaches a minimum value (or is close to it).

Considering this second problem, i.e., the study of variants of the triangular norm, we proceed as follows. We consider incorrect variants of obtaining the triangle inequality, which for matrices of the order of about 50 50 is violated in the two most successful metrics (including the Panin's metric earlier developed by us) in less than 1% of cases. It is important to note that in order to improve the decrease in the quantitative index of the badness of the entire matrix of considerations, we also, rst of all, consider triangles in which the triangle inequality is not violated, but in which the badness value is relatively large. Possible improvements are related to the use of neural networks that were not used by us in previous calculations. In this case, neural networks solve the inverse problem: we improve (reduce) the overall badness of the matrix of distances between genomes, forcibly changing the previously obtained distances; further, we try to re ect these forced changes in the original algorithms for calculating distances.

Thus, to determine the distance between genomes, we need heuristic algorithms; and, if possible, they should not require too much time. There are various such algorithms, but their obvious disadvantage is getting a few di ering results when using di erent heuristic algorithms applied to the distance calculation between the same pair of DNA strings. Therefore, the problem of quality evaluation of the metrics used (distances) arises; and based on the results obtained in solving this problem, one can draw conclusions about the applicability of a particular algorithm for calculating distances to various applied studies. A possible approach to determining the quality of metrics was given in [ 4 ]; also partially we consider these approaches below in this paper.

However, we used the same approach for a completely di erent problem; it is as follows. Even heuristic algorithms require often large time-consuming costs: for example, to construct a matrix of the order of 50 50 into which the distances computed by the Needleman { Wunsch algorithm ([ 6 ] etc.) are recorded, it takes about 28 hours (at a processor clock frequency of about 2 GHz, see [ 4 ]). Therefore, one of the problems considered in the biocybernetics is the problem of restoration of the distance matrix between DNA sequences (below we simply shall write \DNA matrix"), in which not all elements of the matrix under consideration are known at the input of the algorithm, see [ 7, 8 ], compare also [9, 1I0n].connection with all this, another problem arises: to use the developed method of comparative evaluation of algorithms for calculating distances between sequences for a completely di erent purpose, namely, for the brie y described problems of reconstructing the distance matrix between DNA sequences. For this problem, in this article we consider the application of the previously developed method of comparative evaluation ofdistance calculation algorithms between a pair of DNA strings.

Using this approach (i.e., using the method of comparative evaluation of algorithms for calculating distances to matrix reconstruction), the reconstruction itself occurs as a result of several computational passes. On each of the passes, for some of the as-yet-un lled (unknown) elements of the matrix, di erent estimates are obtained; these estimates are averaged in a special way-and the result of averaging is taken as the value of the unknown element. From the physical point of view, the applied averaging gives the position of the center of gravity of a onedimensional system of bodies whose mass is speci ed by a special function, i.e., the risk function, see [ 11, 12 ]. We note that earlier we used risk functions in completely di erent subject areas; these areas were always connected with auxiliary algorithms related to multicriteria optimization.

Below, the matrices to be reconstructed are referred to as incompletely lled with distance matrices. We enter this term for the matrix, from which a number of elements are \crossed out".

2. Preliminaries

Thus, like our previous papers, we consider the square symmetric matrix of distances between genomes. There it is necessary to note the following. First, the genomes we choose random enough and took them o the site [ 13 ]. Second, like previous works, we consider in fact three variants for each of the considered problems: for very distant species, including, for example, a mammal\Bison bison" and a reptilia \Apalone spinifera" (we use the o cial scienti c Latin names), see detailed the species' list by the link for direct download [ 14 ]; for a su ciently close species (human and apes); and also for human races (in fact, they can be considered as subspecies of a biological species).

Let us note, looking ahead, we believe that our a theoretical construction is best applicable for more distant species, however, acceptable results are obtained also in two other cases.

Thus, we look at algorithms of comparing the quality of di erent algorithms that calculate the distance between two genomes. Apparently, all these algorithms are based on the use of various versions of the Levenshtein distance (or Levenshtein metric), see [ 15 ] and very many other following papers. It is very important to note, that for the simplest formulation of the problem (to make the strict calculating the value of Levenshtein metric for two given genomes), we unfortunately obtain a very long-running algorithm (program): it has to do with the actual length of the strings of genomes. Therefore, in each of the actual algorithms (see [ 16, 17, 18, 19 ] etc.), computation of distances between genomes in reality is a heuristic extension of the exact algorithm for calculating the Levenshtein distance. And apparently, the approach closest to our one is given by [ 20 ]; let us note, a little running forward, that it also uses a version of the branch-and-bound method.

Thus, the considered in our previous papers Panin's metric is no exclusion, It also is a heuristic algorithm for calculating the close version of such metric; it is an optimization problems, see [ 21 ] etc. However, we used in it a special approach (so called multi-heuristic approach for discrete optimization problems), and for it, we use the same heuristic as in very di erent problems. From many such problems, let us mention two ones only: the classical traveling salesman problem (however, we consider our own approach to this problem, and, most importantly, our original way of specifying the input data, di erent from the traditional geometric placement, see, e.g., [ 22 ]); and the problem of state-minimization for nondeterministic nite automata, see, e.g.,[ 23, 24 ].

3. The triangular norms: their study and possible attempts of improvement

It was justi ed in our previous works cited above, there is desirable that in the matrix of distances between genomes, any of the resulting triangles be close to an acute angled isosceles one with two angles exceeding 60 degrees. Several various empirically selected numerical characteristics describing such di erences are also given in our works, see [ 4 ] etc. However, in the previous articles we did not consider detailed examples, let us consider them in this section.

In [ 14 ], the results of calculations for several metrics and several norms are presented. In this section, we shall consider the Panin's metric only, and 3 norms (\badnesses" for the triangle under consideration). Thus, for each triangle with the sides a b c and the angles we considered the following norms: bad1 = ( )= ; bad2 = ( )= ; bad3 = (a b)=a : In case if 90o, we have considered each norm by the maximum possible (1:0) or even usually exceeding this value. We assigned an even greater value to the value of badness in the case when the three considered sides do not form a triangle at all (that is, they do not satisfy the triangle inequality); let us note, running ahead, that similar situations is happened for any of metrics considered by us.

The resulting value of the norm of the whole matrix was considered as the arithmetic mean of the norms of all triangles. We note that for the matrices of distances between genomes (usually from 30 30 to 50 50), the number of triangles is 30 29 28 2 3 = 4060 for dimensions 30, and 19600 for dimension 50; from these values, it is clear that the calculations we need are quite di cult.

Thus, let us consider the part of the table given on the page titled \Panin's metrics" of [ 14 ] (they are designed as an xlsx- le and are available for the direct download), see the table on Fig. 1. (The names of the considered species can be found there on the page titled \Types ofanimals".)

Let us choose species with the numbers 5, 15, 25; while doing so, we speci cally chose exactly the three of those considered, where the metric gives rather poor results (that means the following: the badnesses of Panin's metric give relatively worse results than other metrics comparing most other triangles of the matrix under consideration). For all 5 considered metrics, we obtain the following 5 triangles corresponding to the species with the numbers 5, 15, 25: 1) sides 3533, 3142, and 2940 (Panin's metric); 2) sides 1215, 1179, and 704 (van der Loo's 1st metric); 3) sides 4379, 4036, and 4029 (van der Loo's 2nd metric); 4) sides 2943, 2674, and 2492 (Pages' 1st metric); 5) sides 3444, 3046, and 2838 (Pages' 2nd metric) (here, the numbers correspond to numbers of metrics of [ 4, 14 ]). Let us consider these 5 triangles and also 3 other ones, see Fig. 2:

Let us note once again, that we only need the relative lengths of the sides of the triangle.

The further calculations yield the following auxiliary values and values of badnesses, see Table 1.

No. a 1 3533 2 1215 3 4379 4 2943 5 3444 6 5 7 7 8 10

We can see, that three bad orderings for all ve triangles give the same sequence of metrics. Once again, we mention (above, this thing has already been said, but in connection with other facts) that this ordering di ers signi cantly from the ordering of the badness considered for complete matrices, see the results of the calculations below and, in more detail, in [ 4 ]. However, in all our calculations (in any case, with the exception of less than 1% of all triangles, i.e., including ordering for complete matrices), such ordering turns out to be the same for all three norms (badnesses).

Another option for investigating the comparative characteristics of norms is the following one (we will continue to consider 30 species, and the matrix of distance between genomes, given in [ 14 ]). We choose two metrics and for any one xed norm we arrange 4060 triangles in order of increasing values of this norm. At the same time, when reading that both these norms are good (that is, they give acceptable results), we should ideally obtain identical sequences of triangles. Actually, one of these two sequences of triangles is obtained from the other by some sequence transpositions of neighboring elements. Since, as we have noted, the number of triangles in our case is 4060, then the maximum possible number of transpositions of neighboring elements is equal to 4060 4059

= 8 239 770: 2 Let us give concrete results of work (Table 2) for the rst norm (value \bad1") only.

Let us give some comments to this table. The pair of metrics is selected in the rst line. The number of transpositions of neighboring elements (for obtaining the monotone sequence) is given in the second line. The percent of the maximum possible number of transpositions of neighboring elements (8 239 770) is given in the third line. We do not use Spearman's rank correlation coe cient (and some others correlation coe cients used in similar problems); instead of them, we use the linear function having value 1 for 0 transpositions and value 1 for 8 239 770 transpositions.

Still, we note that the two other norms give somewhat worse results; but for them, the number of transpositions does not exceed 600 000 (i.e., less than 7:5%), and, therefore, calculated by our method rank correlation coe cient is more than 0:85.

Thus, all three norms almost always give similar results. Therefore we often use the singular for the word \norm": \some di erent versions of so-called triangular norm" etc.

4. About one method of DNA matrix reconstruction

In this section, one of the methods of comparative analysis of various algorithms for calculating distances between DNA sequences is presented, and on the basis of this, a method for reconstructing an incompletely lled matrix is developed. In order to carry out this comparative analysis, we propose for the resulting algorithm to calculate the distances between genomes, consider all possible triangles, because ideally they should be acute-angled isosceles.

To answer the question of how \correct" is the matrix obtained as a result of some heuristic algorithm, we propose to use the \characteristic of the departure" of the obtained triangles from \elongated isosceles" triangles; i.e., the \badness", below we shall write this term without quotes. In this case, as one of the variants of the badness, the following formula can be used: = (1) where , and are the angles, and we admit, that [ 4 ]. In the opinion of the authors of this paper, this formula best characterizes the requirements described by us. Here is the informal explanation for this: The closer the triangle to the isosceles triangle, the less the di erence between and , and in the ideal case, the numerator is 0; in accordance with our assumptions, an obtuse (or rectangular) isosceles triangle cannot be obtained. The performance of the properties of acute angles increases the denominator. Consequently, the approximation of a triangle to an isosceles triangle reduces the numerator and increases the denominator in the formula, i.e., tends to 0.

When calculating the badness of the entire matrix for each recovery option, we can: either summing the corresponding badness over all possible triangles of the matrices in question;

or take the maximum badness for these triangles.

In the future, we propose to consider other approaches to calculating the badness of the entire matrix.

To determine the unknown element, we consider all the possible triangles formed from elements of this matrix for which one of the sides is unknown. For each such triangle, from the condition that it is isosceles acute-angled, we get one of the possible values of this unknown side. Next, we calculate the nal value of this side (an unknown element) in a special way. Namely, for its calculation, on the basis of all the estimates obtained, the element is assumed to be equal to the arithmetic mean of all the values obtained. As an alternative, we can exclude the largest and smallest of the values obtained.

With a large number of missing elements, the matrix of the triangles with two known sides will be small, so the restoration of the matrix in one pass is usually impossible. When restoring the matrix on the second and subsequent passages, we can either use only the elements of the matrix of the last pass, or use all the matrices obtained in the previous passes. In the second case, with each successive passage in the matrix, there are more and more elements calculated approximately. Therefore, when evaluating an unknown element, it is possible to use the analog of the risk function, which will adjust the weight of the elements depending on the pass number.

When using the so-called static risk function, the weight of the elements with each pass decreases with the same coe cient, and to estimate the unknown element of the matrix, formula E = c0E0 + c1E1 + : : : + ckEk c0 + c1 + : : : + ck ;

Ei is the value of the matrix element, calculated on i-th pass;

c0 ; : : : ; ck are some specially chosen coe cients.

In practice (see [ 5 ]), good results are achieved when the following formulas are used for the coe cients:

By [ 11, 12 ], the risk function can be also dynamic: when using it, we take averaging, depending on the \rough estimate" of the nal value: whether it is \good", \middle" or \bad". Besides, we can also consider the sequence of such dynamic risk functions, where at each stage we rely on the value obtained in the previous step as such a \rough estimate". In our case, to evaluate the unknown element of the matrix of distances between DNA strings, the formula is used; f (x) is some specially chosen decreasing function.

5. The formal description of the algorithm

Let us consider the detailed formal description of the algorithm brie y considered before.

Algorithm 1 (Restoring a matrix using a static risk function)

Input: Incompletely de ned matrix A = aij (all elements equal to zero outside the main diagonal are assumed to be unknown). Used auxiliary variables: bi is the array of unknown element estimates.

Description of the algorithm.

Step 1: We set s := 1 (the number of the pass).

Step 2: We count h, i.e. the number of elements of the upper triangle, which are equal to 0. Step 3: if aij = 0 and i 6= j then begin kol := 0 fWe count the number of triangles,

built on the unknown element under considerationg for k := 0 to n do begin if k 6= i and k 6= j and aki 6= 0 and akj 6= 0 then begin kol := kol + 1; c0 := 1; cs := cs 1 p;

Eki := Ekj := c0Ek0i + : : : + csEksi ;

c0 + : : : + cs c0Ek0j + : : : + csEksj ;

c0 + : : : + cs end;

if Eki > Ekj then bkol := Eki else bkol := Ekj end; end; aij := b1 + : : : + bkol :

kol Step 4: We count h1, i.e., the number of elements of the upper triangle, which are equal to 0 after the next pass.

Step 5:

if h1 = 0 then goto Output 1; if h1 = h then goto Output 2; s := s + 1; goto Step 2.

Output 1: Filled matrix A.

Output 2: Matrix A cannot be lled.

End of description of the algorithm.

After execution of the algorithm for performing a comparative analysis of the results of the reconstruction of the matrix, we use such an indicator as the residual; it characterizes the deviation of the resulting matrix from the original one. We calculate the residual on the basis of the natural metric d = qPn 1 Pn i=1 j=i+1(aij n(n 1)=2 aij )2 f ; aij are elements of the matrix obtained as a result of applying some algorithm for calculating f the distances between a pair of genomes (in our case, the Needleman { Wunsch algorithm); aij are elements of the matrix restored as a result of the above algorithm.

Some examples of operation of the algorithm and corresponding values of residual are given in [ 5 ].

6. Conclusion

Thus, a very small change of the given matrix greatly reduces its badness. The values marked in red are chosen by us manually. However, at the present time we have a neural network that implements such an algorithm; we are going to describe this neural network in the following publication. But the following is much more important: these \red" changes give input information to another neural network, the one that computes the constants for the metric. Thus, along with one \loop" of algorithms already described in previous section, we get one more. Table 3 above already takes into account similar improvements in the metric.

As we said before, the received results of our computer programs are designed as an xlsx- le and are available for the direct download by [ 14 ]. The whole article i s actually a comment to this le.

We also note that di erent \places" of di erent metrics for di erent cases (i.e., the case of distant animal species, the case of close animal species and the case of subspecies) talk about the need to continue research in this direction.

Also in the near future we expect to develop other approaches for comparative analysis of various algorithms for calculating distances between sequences, as well as describe the algorithms for reconstructing matrices based on these approaches. At present, we are working on comparing two of such algorithms, both for application in the \normal" problems of DNA analysis, and in the problems of DNA matrix reconstruction close to those considered in the present paper. Acknowledgements The authors of the article express their gratitude to Vladislav Dudnikov (Togliatti State University, Russia) for his help in preparing this paper. The reported study was partially supported by RFBR according to the research project No. 16-47-630829.

[1] Melnikov

and Panin

A 2012

a parallel implementation of the multi-heuristic approach in the problem of comparison of genetic sequences

Vektor Nauki of Togliatti State University 4 ( 22 ) 83 - 86 (in Russian)

[2] Makarkin

, Melnikov

and Panin A 2013 A parallel implementation of the multi-heuristic approach in the task of comparing genetic sequences Applied Mathematics 4 ( 10 ) 35 - 39 DOI: 10.4236/ am. 2013 .410A1006

[3] Melnikov

, Pivneva

and Trifonov

M 2017

Multiheuristic approach to compare the quality of defined metrics on the set of DNA sequences Modern Information Technologies and IT Education 13( 2 ) 89 - 96 (in Russian) DOI: 10 .25559/SITITO. 2017 . 2 . 235

[4] Melnikov

, Pivneva

and Trifonov

2017Various algorithms, calculating distances of DNA sequences, and some computational recommendationsfor use such algorithms CEUR Workshop Proceedings 1902 43 - 50

[5] Melnikov

and Trenina M 2018 On

a problem of the reconstruction of distance matrices between DNA sequences

International Journal of Open Information Technologies 6 1 - 13 (in Russian)

[6] Needleman

and Wunsch Ch 1970 A generalmethod applicable to the search for similarities in the amino acid sequence of two proteins Journal of Molecular Biology 48 ( 3 ) 443 - 453

[7] Eckes

, NischtR and Krieg T 2010 Cell -matrix interactions in dermal repair and scarring Fibrogenesis Tissue Repair 3 ( 4 ) 1 - 13 DOI: 10.1186/ 1755 -1536-3-4

[8] Midwood

K S

, Williams L Vand Schwarzbauer J E 2004

Tissue repair and the dynamics of the extracellular matrix

International Journal of Biochemist&ry Cell Biology 36 ( 6 ) 1031 - 1037

[9] Evdokimova

N I

and Kuznetsov

A V

2017

Local patterns in the copy-move detection problem solution

Computer Optics 41 ( 1 ) 79 - 87 DOI: 10.18287/ 2412 -6179-2017-41-1- 79 -87

[10] Evsutin

O O

, Shelupanov

A A

, Meshcheryakov

R V

and Bondarenko

D O

2017

An algorithm for information embedding into compressed digital images based on replacement procedures with use of optimization

Computer Optics 41 ( 3 ) 412 - 421 DOI: 10.18287/ 2412 -6179-2017-41-3- 412 -421

[11] Melnikov

and Radionov A 1998 On the choice of strategy in nondeterministic antagonistic games Programming and Computer Software 5 55 - 62 (in Russian)

[12] Melnikov

B 2001

Heuristics in programming of nondeterministic games Programming and Computer Software 5 277 - 288

[13] Nucleotide ( The Nucleotide database) (Access mode : http://www.ncbi.nlm.nih.gov/nuccore)

[14] Melnikov B The processed results of the computer calculations (Access mode: http://bormel .ru/ BorMel-DNA.xlsx)

[15] Levenshtein

V 1966

Binary codes capable of correcting deletions, insertions , and reversals Soviet Physics Doklady10 ( 8 ) 707 - 710

[16] Winkler

W 1990

String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage Proceedings of the survey research methods sections , American Statistical Association 4 ( 22 ) 354 - 359

[17] Pages

, Aboyoun

, Gentleman

and DebRaoy S Biostrings : String objects representing biological sequences, and matching algorithms (Access mode: https://rdrr .io/bioc/Biostrings/)

[18] Morgan

, Anders Sand Lawrence M ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data (Access mode: https:// www .ncbi.nlm.nih.gov/pmc/articles/PMC2752612/)

[19] Van der Loo M 2014 The Stringdist

Package for Approximate String Matching

R Journal 6 111 - 122

[20] Althaus

, Caprara

, LenhofH-P and Reinert

K 2006

A branch-and-cut algorithm for multiple sequence alignment

Mathematical Programming 105 387 - 425

[21] Melnikov

B 2006

Multiheuristic approach to discrete optimization problems Cybernetics and Systems Analysis 42 ( 3 ) 335 - 41

[22] Makarkin

and Melnikov

B 2013

Geometrical methods for solving the pseudo-geometric version of the traveling salesman problem Stochastic optimization in informatics 9(2 ) 54 - 72 (in Russian)

[23] Melnikov

B 2000

Once more about the state-minimization of the nondeterministic finite automata

Journal of Applied Mathematics and Computing 7 ( 3 ) 655 - 662

[24] Melnikov

and Tsyganov

A 2012

The state minimization problem for nondeterministic finite automata: the parallel implementation of the truncated branchand bound method Proceedings 5th International Symposium on Parallel Architectures, Algorithms and Programming (Taipei) 194 - 201