<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>American Statistical Association 1990: 354-359.
[5] Van der Loo MPJ. The Stringdist Package for Approximate String Matching. The R Journal 2014</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1186/1755-1536-3</article-id>
      <title-group>
        <article-title>Various algorithms, calculating distances of DNA sequences, and some computational recommendations for use such algorithms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>B.F. Melnikov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>S.V. Pivneva</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M.A.Trifonov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center of Information Technologies and Systems for Executive Power Authorities</institution>
          ,
          <addr-line>19, str. 1, Presnenskiy Val, 123557, Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Togliatti State University</institution>
          ,
          <addr-line>14, Belorusskia st., 445020, Togliatti</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2005</year>
      </pub-date>
      <volume>2014</volume>
      <fpage>354</fpage>
      <lpage>359</lpage>
      <abstract>
        <p>In recent years, various approaches have been described to determine the similarity of the DNA sequences; each of which defines a metric for the set of DNA sequences. In this paper, we propose a new approach to solving this problem; moreover, algorithms for its implementation are based on multiheuristic approach to the discrete optimization problems previously developed by us. However, the main focus of this article is to describe our original approach to compare the quality of defined metrics on the set of DNA sequences. The last approach is based on the fact, that the triples of distances between genomes should ideally form isosceles acute triangles. On the basis of this assumption, we proposed value of the norm, gives in practice acceptable results; the validity of this approach is also discussed in the article. In the course of work on the implementation of algorithms have been carried out computational experiments with 100 DNA of “distant” species, as well as with representatives of several genomes of great apes and humans. Several possible standards defined comparative quality algorithms describing metric distances on DNA sequences. Thus, the main focus of this article is to describe our original approach to compare the quality of defined metrics on the set of DNA sequences. The approach is based on the fact, that the triples of distances between genomes should ideally form isosceles acute triangles. On the basis of this assumption, we proposed value of the norm, gives in practice aссeptable results. In the course of work on the implementation of algorithms have been carried out computational experiments with 100 DNA of “distant” species, as well as with representatives of several genomes of great apes and humans.</p>
      </abstract>
      <kwd-group>
        <kwd>metric evaluation</kwd>
        <kwd>algorithms</kwd>
        <kwd>multiheuristic approach</kwd>
        <kwd>original approach to compare the quality of defined metrics on the set of DNA</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The problem of determining the similarity of DNA is a special case of non-exact matching sequences[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The “non-exacting”
(“mistaking”) is that when comparing lines it is possible to identify similar sequences, despite the errors and distortions in them,
for example, changing, deleting, or inserting some characters. The amount of such distortion sets metric on the set of rows,
which is determined by the minimum number of edit operations that provide a single line of another. This problem occurs in
many areas. For example, a comparison of the genes and chromosomes of proteins is a major challenge and one of the main tools
of molecular biology and bioinformatics [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1,2,3,4,5,6,7</xref>
        ]. The exact comparisons of nucleotide chains (and also computing
distances using such comparisons) are unacceptable because of errors in the data, and due to possible mutations. Inaccurate
mapping is carried out like the text processing. One of the metrics obtained by comparing the words (Levenshtein distance) is
used to correct errors, to enhance recognition of scanned documents, to search in the information systems and databases [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].To
find an approximate solution, there are different algorithms in different subject areas, for example, to search a database of
genetic information is widely used algorithm BLAST ([
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]etc.), approximating Needleman-Wunsch algorithm.
      </p>
      <p>Thus, in Section 2 of this paper we describe the application to the defining similarity of DNA sequencesso called
multiheuristic approach [8,9], which is in fact the big extension of the branch and bound method. Note that previously, before
our works, branch and bound method, apparently, was not used to solve this problem.</p>
      <p>
        Thus, the calculation of the distance (metric) between the rows of DNA of different species of organisms is one of the most
important tasks of modern bioinformatics. As already noted, today there are many algorithms allowing to make an approximate
calculation of the polynomial time ([
        <xref ref-type="bibr" rid="ref4">4,5,6,7,10</xref>
        ] and many others). The obvious disadvantage when calculating the distance
between the one and the few lines of DNA is to provide different results when using different algorithms to calculate metrics.
However, the authors do not know the work, which compared to a variety of algorithms for solving this problem. In this regard,
one of the tasks that are discussed in this article was to develop a method for the comparative evaluation of such algorithms;
moreover, this problem seems to be the most important our consideration. As a result, we have proposed a method of evaluation
using the properties of an isosceles triangle in a metric space (Section 3, so called “triangular norm”, we calculate using it so
called badness, related to some metric for several species).
      </p>
      <p>We are also looking at options to improve some existing metrics. In this case, none of the methods we are considering the
construction of the distance between the strands of DNA is not a disadvantage to use it to evaluate the distances between the
neighbors as species (pairs “human – chimpanzee” and “human – bonobo” etc.), and between more distant species (pairs “human
– crocodile” and “chimpanzee– crocodile” etc.). This is because we consider first of all corners of the triangles in the Euclidean
space. However, we have made some computational experiments connected with the application for conversion metrics of
continuous monotone functions (Section 4).</p>
      <p>Brief results of computational experiments over 100 genomesare considered in Section 5. Among these results, it is worth
noting the following. Firstly, for the “distant” species badness is very small, which indicates that the right choice of our
approaches and relevant specific algorithms; in this case the fact is true for a number of different metrics. Secondly, as fo r the
“distant” species we proposed approach to the definition of the metric gives the best results (all considered “triangular”</p>
      <p>
        High-Performance Computing / B.F. Melnikov, S.V. Pivneva, M.A.Trifonov
standards), among 5 considered metrics [
        <xref ref-type="bibr" rid="ref4">4,5,6,7,10</xref>
        ]. For the “near” species (humanandapes), the results are somewhat worse
(value of badness is increased, and, besides, our version of the metric gives 2nd for the quality of the result). Third, it is unlikely
any of these metrics are appropriate for determining the distance between the subspecies: so that the application of these
algorithms to the human race sometimes arises violation of the triangle inequality. The accurate explanations of recent facts,
apparently, should lead biologists, but we also try to explain them, from our point of view.
      </p>
      <p>Some possible areas for further work already ongoing by our group at the moment are summarized in Conclusion (Section 6).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Algorithm for determining the distance between nucleotide sequences based on the multiheuristic approach</title>
      <p>As we said before, the multiheuristic approach to the problems of discrete optimization was considered [8,9] and in many
other following papers. In this section, we describe its version for determining the distance between nucleotide sequences. For
this problem, it was used as follows1. Let x, y be corresponding strings, i, j be indexes of considered symbols of strings x and y
correspondently, r be the value of metric to be found. By shifting the line, we mean increasing by 1 of the corresponding index.
The general scheme of the algorithm can be described as follows.</p>
      <p>
        The cost of matching two symbols in a simplest case equals to 1; for DNA, it can be defined using some table of amino acid
replacement costs, e.g. BLOSUM, see [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2, 11</xref>
        ].
      </p>
      <p>For this algorithms, the following heuristics were used.
1. We select such trajectories that the value (i' - i) + (j' - j) is minimum, or close to minimum. E.g. we first lookup all the
trajectories with one string shifted by one symbol; next with one string shifted by two symbols or both strings shifted by
one symbol, etc.
2. We shift a string, which current symbol found less frequent in the other string. For this heuristics it’s preferable to know
probabilities of appearance of a given symbol in each of the strings. If those probabilities are not known a priori, we
consider them being equal. While following the algorithm we can adjust those probabilities or use aging algorithm [12],
such that probability of a given symbol will be defined by some fragment of a string instead of a whole string. If
probabilities for both strings are equal, we shift a string in which more symbols are left.
3. Combination of previous heuristics (1 and 2); to calculate the position using second heuristics we sum probabilities of finding
other string for all symbols that will be passed by a shift.
4. Use of an algorithm of a longest common subsequence search for x[i..i+k] and y[j..j+k], where k ~ 15. For shift we use i', j', at
which the longest common subsequence ends. If no common subsequence found, the search range is increased. When
using this heuristics the result is close to the longest common subsequence value.
5. Combination of 3 and 4; the position (i', j') given by forth heuristics is a ratio of length of the longest common subsequence of
strings x[i..i'] and y[j..j'] to an average shift length from (i, j) to (i', j').
6. We use algorithm [13, 14] for strings x[i..i+k] and y[j..j+k], where k ~ 15, then shift to (i', j'), having the greatest value in</p>
      <p>Needleman-Wunsch table.</p>
      <p>Combination of 3 and 6; the position (i', j') given by sixth heuristics is a ratio of a value in Needleman-Wunsch table,
corresponded to that position, to average shift length from (i, j) to (i', j').</p>
    </sec>
    <sec id="sec-3">
      <title>3. The decision-making using some different greedy heuristics simultaneously</title>
      <p>Thus, we shall to use not only the heuristics, which forms branch-and-bounds method (BBM) for the considered discrete
optimization problem (DOP), but also the other one. Namely, it is heuristics for selecting element, which separate the considered
problem into right and left sub-problems. However, we can often replace a heuristics separating algorithm for some other.
Moreover, solving a DOP using BBM, it is often desirable to choose one of some separating algorithms, depend upon the solved
sub-problem. The various separating algorithms can be selected depend upon dimension of the solved problem, its bound, and
also many other descriptions, which are based on the solved DOP.
1 We have changed here the description of the algorithm given in [10]. The authors are willing to send me the source code when prompted by email.</p>
      <p>High-Performance Computing / B.F. Melnikov, S.V. Pivneva, M.A.Trifonov</p>
      <p>In classical examples of using BBM for TSP, some good separating algorithms were used (it means, that they give the
reasonable results comparing other ones). However, long before, for the BBM-branching various other heuristics were used. Let
us mention, for example, the following heuristics for the reduced TSP-matrix: total number of zeroes, sum of minimums for all
the rows and columns, sum of some minimum values of considered row and columns multiplied by special “damnuation
constants”; all these values are computed by the TSP-matrix after reducing and selecting separating element (i.e., separating
edge for branching). Probably, the author mentioned here less than 10% of the heuristics used before.</p>
      <p>Thus, how can we use the fact, that in various situations (i.e., in various sub-problems of the same considered DOP) different
heuristics relatively better are used? (This question can be putted for both exact and unfinished algorithms.) We need to make a
decision for selecting the separating element for branching. We have information of various experts, i.e., of various special
heuristics, so called predictors (or estimators). The predictors often give discrepant information, and we have to average it in a
special way. Unlike all the algorithms published before, the authors, like programming nondeterministic games, use dynamic
risk functions for this thing.</p>
      <p>Since various heuristics give values of various units, we have to normalize them for computing the final result. Using the
special set of normalizing coefficients (it is adjusted, for example, for genetic algorithms, we shall consider them below) is,
probably, a possible method, but it is not the main one for this paper. The authors used only one algorithm in various DOP; that
was a special modification of “voting method”, where each of heuristics gives the considered variants of selecting (e.g., for
traveling salesman problem, those are zeroes of the reduced matrix, corresponding some edges, which are the candidates for
branching). After that, we use special dynamic risk functions for the results of voting.</p>
      <p>Thus, selecting edge for branching is constructed for BBM by using dynamic risk functions; see previous sections for details.
It is important to note, that the dynamic selecting of the particular risk function is similar to selecting it in programming
nondeterministic games, considered in. Since we consider here DOP (not programming nondeterministic games), we have to add
here a new heuristics, i.e., one for selecting “current position estimation”, i.e., evaluation of the situation, which is obtained by
the solving some DOP using BBM.</p>
      <p>Thus, let us have some various heuristics for selecting the element for the next step of BBM (or, generally speaking, for
selecting the strategy of solution some DOP). Let each of possible strategies have some various expert evaluations of availability
(i.e., let us have some independent expert sub-algorithms, i.e., predictors). Then the concluding strategy could be chosen by
maximum of average values. However, let us consider the following example [24]; this example is connected with backgammon
programming, because it uses 36 predictors).</p>
      <p>
        Let expert evaluations of availability have values in the segment [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ]. Let for the 1st strategy, the 1st expert has the evaluations
of availability being equal to 1, and other 35 experts have the evaluation 0.055. And for the 2nd strategy, 2 experts have the
evaluation 0.95, and other 34 experts have the evaluation 0. Very likely, each user (expert-human) on basis of these values
chooses the 2nd strategy.
      </p>
      <p>However, averaging-out by the simplest algorithm (i.e., simply the simple average of expert evaluations) gives 0.081 for the
1st case and 0.053 for the 2nd one; i.e., do we have to choose the 1st strategy?</p>
      <p>Let us make computations like to our previous papers, i.e., having the same algorithms of dynamical risk functions (DRF)
constructing. For the 1st strategy, we obtain the following risk function:</p>
      <p>–0.685 × x2 + 1.300 × x + 0.386,
and for the 2nd strategy:</p>
      <p>–0.694 × x2 + 1.374 × x + 0.321.</p>
      <p>The final values of averaging-out of expert evaluations using these risk functions are 0.111 for the 1st strategy and 0.147 for
the 2nd strategy. Therefore, using such algorithms of DRF for special averaging-out of expert evaluations gives “natural”
answers.</p>
      <p>Remark that twice repeated using of averaging-out (i.e., averaging-out using preliminary values of the first step of DRF-using)
chooses the 1st strategy again. However, in the limit we have “natural” answers again. Let us describe these results by the
following table; the names of columns are equal to the number of step of averaging-out using DRF (i.e., the number of using
algorithm constructing DRF). Then the column 0 is the simple average of expert evaluations), and the column ¥ is the limit
value.
1stt strategy
2ndsrategy
0
0.081
0.053</p>
      <p>Let us remark, that this example is really possible in solving real problems: in the real computations for the mentioned DOP,
the situations, when the difference between values of maximum and minimum expert values is more than 0.5 (i.e., more than
50% of the segment of expert values) are very often; for example, for accidental traveljng salesman problem having dimension
75 and some of predictors mentioned before, they contain, by statistics of the author, about 10%.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Some versions of “triangular” norm of quality definitions for distance metric</title>
      <p>So, there are various algorithms to determine the distances between genomes; they can be called algorithms definition of the
metric on the set of genomes2.However, this raises not only the usual questions about the adequacy of the corresponding
mathematical models (from the point of view of the authors, they are usually solved in this domain by experts in biology, [15]
etc.), but also on the comparative evaluation of these models. The most important matter in this case appears the following o ne:
can we talk about the effectiveness of such algorithms and the adequacy of these models based on only one analysis matrices
2 Mathematical aspects of the correctness of using the concept “metric” in this situation, we are expecting to discuss in a future article.</p>
      <p>High-Performance Computing / B.F. Melnikov, S.V. Pivneva, M.A.Trifonov
proximity (distance) between the genomes, without the involvement of biologists? The authors of this paper believe that this
question should be answered in the affirmative.</p>
      <p>
        For several different algorithms [
        <xref ref-type="bibr" rid="ref4">4,5,6,7,10</xref>
        ], we consider the matrices of distances between the genomes; in our
computational experiments (see. below), we used five different algorithms3 andmade corresponding distance matrices, in which
the number of genomes reached 100.
      </p>
      <p>In this case, we used the following natural philosophy(we have not found analogues in the literature);we give it for the
example ofhuman (H), chimpanzee (C) and bonobo (B).According to biologists, Sand Bdispersed (had a common ancestor),
according to various estimates, about 2–2.5 million years ago(no wonder; the alternative name of B is “pygmy C”, [16]), and
Hdispersed with both other species 5.5–7 million years ago4.In this connection, the following question arises: why Hshould be
closer to Bcomparing S? or vice versa: why it should be closer to Ccomparing B? Obviously, the answer to both these questions
is negative, i.e., by other words, the explanation of the greater intimacy cannot exist. Therefore, in the matrix of distances
between the genomes of all received triangles should ideally be acute isosceles ones.</p>
      <p>To compare the quality of algorithms for constructing the distance we have offered several versions of “waste” (so called
badness) of such “longisosceles” triangles. Apparently, when calculating the badness of the whole matrix for each option, we
should always appropriate to summarize all the badness of all possible triangles of the matrices; we make this thing in our work.</p>
      <p>So, in simple cases5, we will assume badness (norm) of the entire sum of the distance matrix, and for the badness of each
triangle will apply one of the following 4 options. (We assume everywhere, that the considered triangle has sides a, b and c,
moreover a ≥ b ≥ c;the angles are α, β and γ, moreover α ≥ β ≥ γ.)
1. (α–β) / π.
2. (α–β) / α.
3. (a–b) / a.
4. For the final norm, we consider separately “violation of an isosceles” and “violation of an acute-angled”:
 (A) 1 – min (b/a, c/b) ;
 (B) max (a– π/3, 0) / (2π/3) ;
and the general answer is (A+B) / 2 .</p>
      <p>The maximum value of badness (in each of these four cases) to a triangle may be equal to 1. At the same bad case of
algorithms for constructing metrics (i.e., when violation of the triangle inequality occurs)we believe this value from 1 to 2 (also
depending on the quantitative characteristics of the violation).</p>
      <p>As we noted above, some results of calculations are given below.</p>
    </sec>
    <sec id="sec-5">
      <title>5. The preprocessor computing as a special normalization of date</title>
      <p>Results and discussion may be presented in separate sections or combined into a single section, whichever format conveys the
results in the most lucid fashion. The authors should discuss the significance of observations, measurements, or computations
and should also point out how these contribute to the aims indicated in the Introduction. Tables, Figures, and Figure Captions
should be embedded within the Section.</p>
      <p>In this section we consider another heuristic, which can be considered optional for all heuristics of “violation of an isosceles
and of an acute-angled ”considered before. For this thing, we consider a function of the type f(x)=xα, where the value α (usually,
0&lt;α&lt;1) is chosen for each matrix of distances (see below some more about selecting α).Where each of the x of distance matrix is
replaced by f(x).</p>
      <p>To select specific values of α, improving, from our point of view, the quality of the choice of metrics, we applied the following
considerations. Below, considering the description of the results of computational experiments, we will show, that various
heuristics select metrics are relatively different priorities for genomes for“distant” and “close” species; and it is worth noting
that such a priority varies little with his study at different rates described above. Attempts to improve the value of these norms
(badness’) applying some functions of the type f(x)=xα are unsuccessful: solutions of the corresponding minimization problems
given either the maximum or minimum valueα (among all the possible ones); it is easy to understand, that in this case, we obtain
the matrix of distances between genomes triangles “are closest to the acute-angled isosceles”. There fore, if we really try to
improve quality metrics, it is necessary to use a fundamentally different heuristics. For this thing, we were trying to find a
function of the above type in which the set of values of the distance matrix, viewed as a distribution of a random variable, is
obtained as close as possible to a uniform distribution6; in advance, we note that for different tasks (i.e., for different concrete
matrices of distances),the values of α, obtained by pseudo-optimal real-time algorithms (which are realized by algorithms of
[8,9,14] etc.) are different.</p>
      <p>In this case, we have chosen the goal function on the basis of the entropy maximization ([17] and many others). Specific
outcomes associated with the use of this heuristic are given below.
3
4
5
6</p>
      <p>Especially note again, that among these algorithms is one of our, the original one.</p>
      <p>It is important to note that the exact values of time such models are not important!
We note in advance that we will consider a somewhat more complex option.</p>
      <p>Informally that can explain, for example, as follows.We already know, that in our model,genomesof human (H), bonobo (B) and crocodile (C) form a
"stretched" acute-angled triangle, which is close to an isosceles one. In this case, the exact values of the lengthsH–C and B–C unlikely to be of interest;it is only
important, that they are approximately equal. Also unlikely,that the value ratio of the length of H–Bto the length ofH–Cis to be of interest.
Felis catus
0,625668764
0,602974896
0,606326063
0,324592863
1
0,451408078
0,475042624
0,335116703
0,563642777</p>
    </sec>
    <sec id="sec-6">
      <title>7. Some final results of calculations</title>
      <sec id="sec-6-1">
        <title>Below, we shall call:</title>
        <p> our original algorithm for constructing a metric between genomesby the first one (belowNo.1,see [10]);
 one of algorithms of M. van der Loo etc. (below No. 2, see [5], used functionis jarowinkler()) by the second one;
 another algorithm of M. van der Loo etc. (below No. 3, see again [5], used function is stringdist()) by the third one;
 one of algorithms of H. Pages etc. (below No. 4, see [6], used function is stringDist()) by the fourth one;
 another algorithm of H. Pages etc. (below No. 5, see [6], used function is pairwiseAlignment()) by the fifth one.</p>
        <p>Let us denote, that algorithms No. 4 and No. 5 are “non-symmetrical” ones, and, when filling in the distance matrix, we used
half-sums of the two obtained values. Also let us note that the violations of the triangle inequality were recorded only as a result
of the algorithms No. 4 and No. 5; however, in the case of “distant” species, we had a few such results: approximately,1 case per
2000 examined potential triangles.</p>
        <p>For further counting, we firstly have randomly chosen genomes of 100 representatives of the species, given in [18] (the case
of considering“ distant species)7.Some results of computations(the table 100х100, i.e., 100·99/2=4950 values, making
(100·99·98) / (2·3) = 161700 triangles) are given below in Table 1, where:
the rows are number of the algorithms (as we write before);
the columns: approximate time of the creation of the distance matrix (for making all the 4950 values, CPU clock speedis
approximately 2 GHz); number of violations of the triangle inequality (in average 1000 launches for triangles); middle badness,
countedfor all the algorithms 1–4 countingbadness for each triangle, see Section 3.</p>
        <p>All values badness we give up to 3 decimal places (the time of algorithms for constructing matrices was recorded some less
accurately).In all tables, we celebrated the best metric for the considered norm (it is singled twice) and also the 2nd place (it is
singled once).</p>
        <p>As we can see, the algorithm implemented by our group, the majority of rules is optimal. And there is very important to
remark (it follows from the above), that heuristics that were used to create this algorithm had absolutely no connection with the
heuristics used to describe the “triangle rules” defined below.</p>
        <p>Secondly (the case of considering “near” species), we also randomly have chosen genomes of humanand5 apes (bonobo,
chimpanzee, gorilla, orangutan, gibbon), which are also given in [18]. In this case, each species taking 4 -5 representatives (of 28
genomes);for the human, we took the genomes of different races. Some results of computational experiments are given in Table.
2, where, unlike Table 1, we failed build time. Furthermore, due to the small total number of triangles (less than 5000),we have
brought the number of violations of the triangle inequality (rather than relative values of this quantity).
7</p>
        <p>All lists of specific species corresponding genomes taken primarily from the site [18].The authors are ready to send the obtained values of distance matrices,
as well as the source code, by e-mail(at your request). We are also ready to send the detailed results of calculations of the badness, including not only the
averaged, but all produced in the process value.</p>
        <p>High-Performance Computing / B.F. Melnikov, S.V. Pivneva, M.A.Trifonov
Table 6. “Near” species.</p>
        <p>No. violations
badness-1, badness-2, badness-3, badness-4,
(α–β) / π (α–β) / α (a–b) / a (A+B) / 2
1 0 0,0757 0,152 0,0645 0,364
2 1 0,0333 0,0687 0,0302 0,215
3 1 0,514 0,622 0,170 0,582
4 32 0,0595 0,122 0,0496 0,341
5 39 0,0741 0,151 0,0615 0,350</p>
        <p>As we can see, the relative number of violations of the triangle inequality significantly increases. In addition, our original
distance metric between genomes is now not optimal.</p>
        <p>Thirdly, we used “preprocessor” algorithms as previously described. It should be noted that the application of these auxiliary
algorithms decreased value of bad nessin almost all cells; however, it was not the goal of this algorithm. Besides, “leaders little
changed”, i.e., our algorithm for constructing the metric (string 1) shows better results (comparing the absence of these auxiliary
algorithms); however, the latter fact is just and can be explained by “tuning” the algorithm 1 for its use for a greater range of
values. The results of computational experiments are given belowin Table 3.</p>
        <p>violations badness-1, badness-2, badness-3, badness-4,</p>
        <p>(α–β) / π (α–β) / α (a–b) / a (A+B) / 2
1 17 0,140 0,243 0,0924 0,325
2 29 0,119 0,173 0,0359 0,342
3 30 0,420 0,527 0,187 0,493
4 30 0,119 0,218 0,0880 0,313
5 26 0,129 0,229 0,0881 0,306</p>
        <p>We could make a lot of comments to the values listed in this table; let us consider only the main one. A relatively large
number of violations of the triangle inequality (and consequently, significantly larger values of badness, at its counting on any of
the rules),is apparently due to the large number of people crossing concrete already after the separation of races. I.e., apparently,
these algorithms should not be used to the genomes of subspecies (without their further modifications).</p>
        <p>However, despite this fact, we are going to publish some further improvement of our original algorithm for constructing
metrics, as well as our approach to the description of the badness. Besides, different algorithms for constructing metrics may be
more appropriate with respect to different situations.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>8. Conclusion. Some future ways for research</title>
      <p>In this section, we look briefly at some algorithms that are going to be published in subsequent papers.</p>
      <p>As a possible connection between the two approaches to solving problems biocybernetic sand the traveling salesman problem
(at first, so calledits pseudo-geometrical version, see [8,19] etc.) we may call not only the above-mentioned multi-heuristic
approach to the problems of discrete optimization, butalsoso calledalgorithms forpseudo-placing dotsin k-dimensional Euclidean
space [19]. These auxiliary algorithms improve the performance of other our algorithms. Moreover, algorithms similar to ones
used by us in [13,14] could also be considered as such auxiliary algorithms; they some improve algorithms d escribed in this
article. Described in these articles use the risk functions of this direction is also, and we apply the special applications of the
well-known“3 sigma rule”.</p>
      <p>Besides, one of the most frequently discussed in biocybernetics problems is that the recovery of the distance matrix, when
weknow a part of the completed element sonly [11, 20].Using the same “triangular norm”, we propose an original algorithm for
such recovery; we are going to write about it in the near future.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <sec id="sec-8-1">
        <title>The reported study was funded by RFBR according to the research project № 16-47-630829.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Gusfield</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Algorithms</surname>
          </string-name>
          on Strings,
          <source>Trees and Sequences: Computer Science and Computational Biology</source>
          . Cambridge,
          <year>1997</year>
          ; 631 p.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Toppi</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Vico Fallani</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petti</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veссhiato</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maglione</surname>
            <given-names>AG</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cincotti</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salinari</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mattia</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Babiloni</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Astolfi L</surname>
          </string-name>
          .
          <article-title>A new statistical approach for the extraction of adjacency matrix from effective connectivity networks</article-title>
          .
          <source>IEEE Engineering in Medicine and Biology Society (EMBC)</source>
          <year>2013</year>
          ;
          <fpage>3</fpage>
          -
          <lpage>7</lpage>
          :
          <fpage>2932</fpage>
          -
          <lpage>2935</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Torshin</given-names>
            <surname>IYu</surname>
          </string-name>
          .
          <article-title>Bioinformatics in the Post-Genomic Era: The Role of Biophysics. Nova Biomedical Books</article-title>
          , NY,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Winkler</surname>
            <given-names>WE</given-names>
          </string-name>
          .
          <article-title>String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage</article-title>
          .
          <source>Proceedings of the Survey</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>