<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Comparative Analysis of the Tsarev Combined Algorithm</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mikhail Sadovsky</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmitry Dokuchaev</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Computational Modelling of the Siberian Branch of the Russian Academy of Sciences</institution>
          ,
          <addr-line>50/44 Akademgorodok, Krasnoyarsk, 660036</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Siberian Federal University</institution>
          ,
          <addr-line>79 Svobodny pr., Krasnoyarsk, 660041</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>25</volume>
      <issue>2021</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper presents the combined Tsarev algorithm and its comparison with the Brute Force algorithm in terms of time performance and efficiency for problems of searching for the longest pattern in sequences. In the present study experiments were made with different types of DNA sequences such as: Candida Albicans, Candida Dublinesis and Coronavirus.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Pattern recognition</kwd>
        <kwd>combined Tsarev algorithm</kwd>
        <kwd>Brute Force algorithm</kwd>
        <kwd>comparative analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Tsarev algorithm for a common subsequence search</title>
      <p>Let us consider the problem of the longest pattern search using the combined Tsarev algorithm.
Let  and  be two symbol sequences from a finite alphabet. In our cases, the alphabet comprises
four symbols A, C, G, T. Find all comparatively long common subsequences occurred in them.
“Comparatively long” means here that the length of a common subsequence (if any) is no smaller than
max( ,  ) /20.</p>
      <p>Let k be the length of the window for searching and L be the expected length of the pattern.
Following the Tsarev algorithm, let us introduce the constraint =  (√ ). Using the window let us
split the sequence  into the frequency dictionaries with the length k with the step k, and split the
sequence  into the dictionaries with the length k with the step k-1, respectively. One obtains two
sets of frequency dictionaries for each of the presented subsequences at the input. Assuming that there
exists a common subsequence with the length, L then, in the frequency dictionaries with the length k
and step k and the length k and step k-1 there must be at least one coinciding word.</p>
      <p>Let us make a comparison of the two sets of the frequency dictionaries following the principle
“each to each” until we find a coincidence. Consider the case of the exact coincidence of the words
with the length k. If the respective coincidence is found, we expand two words to the left and then, to
the right, respectively, until the elements of the sequences 
and 
coincide. The obtained
expansion will be the sought longest pattern.</p>
      <p>Due to the fact that in many applied problems the value L is not known in advance, it becomes
necessary to estimate it. This can be done by changing the value of the parameter k. Namely, if none
of the words of the length k matches, one needs to decrease the parameter k. Decreasing the parameter
will automatically decrease the expected value L. By repeating the search procedure, one can find the
new longest pattern. The main purpose of this study is to describe a method for the suboptimal choice
of the decrease character of the parameter L.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Choice and change of the parameters k and L</title>
      <p>In the present study experiments were made with different types of DNA sequences: Candida
Albicans, Candida Dublinesis, Coronavirus. These sets have DNA sequences of different length,
which is manifested in the above introduced parameters of the Tsarev algorithm. The Coronavirus set
was chosen for comparative analysis. This is due to the length of DNA sequences in this set being
approximately equal to 30,000 symbols, which allows applying various algorithms to this set,
including ineffective ones, such as Brute Force. This sequence set consists of 67 sequence copies.</p>
      <p>A peculiarity of the Brute Force algorithm is that it does not require the introduction of additional
parameters. The algorithm is based on the complete brute force search of all possible combinations of
subsequences and their direct comparison. Obviously, this approach is ineffective and very costly in
terms of the resources of the machine on which the calculations are performed. This algorithm can be
applied to a set of Coronavirus sequences without huge time delays and one can obtain an accurate
result in finding the longest subsequence. By applying this algorithm to a set of sequences, it was
possible to obtain the longest accurate recurrent patterns, as well as to estimate the operating time
indicators of the Brute Force algorithm.</p>
      <p>The main problem here is the problem of choosing an efficient way to change the parameter k for
implementing the Tsarev combined algorithm. In the case of Coronavirus genome analysis the value L
was unknown. Therefore, at the first step the values 
= 10 were chosen, which was approximately
equal to max( , 
) /3. Then, the window length was 
= 100. The chosen value L turned out to
be too high: in fact, there are no identical subsequences of such a length in these genomes. A number
of experiments was made using different ways of changing the parameter k, for example, 
=
( ⁄√2), 
= 
(0.8 ∗  ),

=</p>
      <p>(0.9 ∗  ). Having made the choice, we performed
calculations according to the Tsarev algorithm described above.</p>
      <p>During the experiments, the choice of the initial value 
and method of changing the parameter k
was found to directly affect the search quality and the running time of the Tsarev algorithm. An
incorrect choice of the parameter k (a too large value) can give a false effect of the Tsarev algorithm:
if k turns out to be comparable with the real length of the longest common subsequence, then the
coincidence of such words (of the k length) can occur at the very first step of the Tsarev algorithm,
and there is no further expansion (contrary to the theoretical prediction). False triggering of the Tsarev
algorithm can be misleading. To test the performance of the combined algorithm, a sufficiently long</p>
      <p>The paper analyzes the combined Tsarev algorithm in comparison with the Brute Force algorithm.
Experiments were carried out to evaluate the performance and efficiency of this algorithm, and its
dependence on the input data. Particular attention is paid to the selection of parameters for the
subsequence was extrapolated from 
of the algorithm operates correctly.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>into  . The latter evidences that the software implementation
combined Tsarev algorithm in order to improve the performance of the algorithm taking into account
the above criteria.</p>
    </sec>
    <sec id="sec-5">
      <title>5. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Apostolico</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Giancarlo</surname>
          </string-name>
          ,
          <article-title>The Boyer-Moore-Galil string searching strategies revisited</article-title>
          ,
          <source>SIAM Journal on Computing</source>
          <volume>15</volume>
          (
          <issue>1</issue>
          ) (
          <year>1986</year>
          )
          <fpage>95</fpage>
          -
          <lpage>105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Boyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <article-title>A fast string searching algorithm</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>20</volume>
          (
          <issue>10</issue>
          ) (
          <year>1977</year>
          )
          <fpage>762</fpage>
          -
          <lpage>772</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Guibas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Odlyzko</surname>
          </string-name>
          ,
          <article-title>A new proof of the linearity of the Biter-Moore string searching algorithm</article-title>
          ,
          <source>SIAM Journal on Computing</source>
          .
          <volume>9</volume>
          (
          <issue>4</issue>
          ) (
          <year>1980</year>
          )
          <fpage>672</fpage>
          -
          <lpage>682</lpage>
          . doi:
          <volume>10</volume>
          .1109/SFCS.
          <year>1977</year>
          .
          <volume>3</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <article-title>Experiments with a very fast substring search algorithm</article-title>
          ,
          <source>Software: Practice and Experience</source>
          <volume>21</volume>
          (
          <issue>10</issue>
          ) (
          <year>1991</year>
          )
          <fpage>1065</fpage>
          -
          <lpage>1074</lpage>
          . doi:
          <volume>10</volume>
          .1002/spe.4380211006.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Tsarev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Sadovsky</surname>
          </string-name>
          ,
          <article-title>New error tolerant method for search of long repeats in DNA sequences</article-title>
          .
          <source>in: Proceedings of International Conference on Algorithm for Computational Biology</source>
          ,
          <source>AlCoB' 2016</source>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>171</fpage>
          -
          <lpage>182</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -38827-4_
          <fpage>14</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>