Comparison of the Alignment and Shaidurov’s Methods the
Search Efficiency in Symbol Sequences with Mismatches
Anna Molyavko1, Evgenia Karepova2, Mikhail Sadovsky2,4, Igor Borovikov3 and
Olga Mutovina1
1
 Siberian federal university, Krasnoyarsk, Svobodny pr., 79, 660041, Russia
2
 Institute of Computational Modelling of the Siberian Branch of the Russian Academy of Sciences, 50/44
Akademgorodok, Krasnoyarsk, 660036, Russia
3
 Nekkar.net LLC, Foster City, CA 94404
4
 Federal Research & Clinic Center of FMBA of Russia, Kolomenskaya str., 26, Krasnoyarsk, 660037, Russia


                Abstract
                The new method for comparison and analysis of symbol sequences is proposed; the method is
                based on the convolution function calculation defined over the binary numeric sequences
                derived from the original symbol sequence. The method provides highly parallel
                implementation and is very powerful in insertion/deletion mutations search. A discrete fast
                Fourier transform is implemented for convolution calculation. Also, an idea of the alphabet
                expansion is proposed to improve the signal/noise ratio. Some genomic applications are
                provided and discussed. The applications are used to illustrate and overcome the problem of
                signal/noise selection, and alignment localization.

                Keywords 1
                Pattern recognition, anomaly detection, parallel computation, InDel-genome comparison,
                knowledge retrieval

1. Introduction
   Comparison of symbol sequences stands behind the methodology in many fields of science,
ranging from mathematics to bioinformatics and linguistics. There are two versions of the problem:
the former is an exact matching search, and the latter is a homology search with mismatches; it is
worth noting that the mismatches should not destroy the sense of comparison. The first version has a
rigorous solution for a long time; however, new algorithms appear effective, e. g. for very long
sequences or many entities under comparison [7, 8].
   The second version is much more complicated since it requires a rigorous and unambiguous
description of admissible mismatches and their metrization. The latter is a problem itself; currently,
the alignment methodology based on the Levenshtein distance [3–5] is the most widely spread
approach here. This methodology has several disadvantages, and the most crucial among them is the
arbitrariness in choosing the score function, resulting in the appearance of various free parameters,
making the alignment an art rather than a science; the alignment divergence since the search
procedure never stops, if no special efforts are made; and finally the insertion and deletion
mismatches bringing the most problematic trouble for the alignment.
   V.V. Shaidurov in [6] proposed a new method of the common subsequence search which is very
easy using insertions and deletions (further referred to as the Shaidurov’s method). It is an alignment-
free method, which is free from any adjustable parameters heavily affecting the final result of the


SibDATA 2021: The 2nd Siberian Scientific Workshop on Data Analysis Technologies with Applications 2021, June 25, 2021,
Krasnoyarsk, Russia
EMAIL: annamo@icm.krasn.ru (A. Molyavko); e.d.karepova@icm.krasn.ru (E. Karepova); msad@icm.krasn.ru (M. Sadovsky)
ORCID: 0000−0003−1016−0131 (A. Molyavko); 0000−0002−6515−2932 (E. Karepova); 0000−0002−1807−0715 (M. Sadovsky);
0000−0002−0861−0001 (I. Borovikov); 0000−0001−9264−859X (O. Mutovina)
           © 2021 Copyright for this paper by its authors.
           Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
           CEUR Workshop Proceedings (CEUR-WS.org)
comparison. This paper shows the feasibility of the method with random four-letter symbol sequences
generated by the Bernoulli process.
   The method is based on calculating the convolution of two polynomials; each polynomial
represents the corresponding symbol sequence. The algorithm of the method results in two sequences
from the same alphabet ℵ{A, C, G, T} of the capacity K = |ℵ| = 4 of the length N1 and N2,
correspondingly. The algorithm comprises four steps. Let us consider the former for the comparing
two sequences T1 and T2; |T1| = N1, |T2| = N2.
   Preprocessing First of all, the sequence (T2, for certainty) must be inverted: b1, b2, . . . , bN2 ⇒
bN2 , bN2 −1, . . . , b1. Then, both sequences must be converted into K binary ones, K = |ℵ| so that each
of K binary sequences corresponds to a peculiar symbol from ℵ. For example, the binary sequence for G
has unity instead of G and zero instead of all the other symbols. Finally, each of 4 × 4 = 8 binary
sequences must be extended to 𝐿 which is the nearest upper power of 2 of the value |T1| + |T2| − 1; this
expansion is required for the fast Fourier transform (FFT) implementation;
     Processing Then, each of the 2K binary sequences is transformed through the calculation of
t h e Fast Fourier transformation, thus changing t h e binary (0, 1) - sequence into a complex one;
     Convolution Each couple of the sequences corresponding to the same symbol must be multiplied,
element by element, and the sum of the products obtained for all the symbols must be calculated, and
     Inverse FFT The sum obtained at the previous step must be processed using inverse FFT which
yields a real number sequence.
     Each element of this real (positive) number sequence derived due to inverse FFT indicates the
number of the coinciding symbols (within the overlapping pattern of these two sequences) regardless
of the particular location of the coinciding symbols. This paper aims at illustrating the highly efficient
feasibility of Shaidurov’s method for the problem of pattern search for considering the biological
matter. The patterns to be found are relatively short (70 to 600 symbols long) subsequences of high
biological value, the so-called transposons. We search for these patterns in chloroplast genomes of
various plants, including Hymnosperms and flowering plants; the point is that some researchers
believe that there are no transposons in chloroplasts [9, 1].


Figure 1: An example of a fuzzy coincidence found using the Shaidurov’s method but with a
CENSOR failure in the search
2. Results

    The alphabet ℵ = {A, C, G, T} corresponds to genetic entities. All the pattern subsequences were
retrieved from the Repbase deposit. We compared the Shaidurov’s method with the standard alignment
provided by the CENSOR software [2]; two parameter sets were used. Both sets are implemented in the
software and support either “rigid” or “flexible” search. The Shaidurov’s method is free from this
arbitrariness.
    CENSOR found a transposon of Copia-18 BD-I type in 181 genomes, and a transposon of MuDR-64
OS type in 44 genomes. Reciprocally, the Shaidurov’s method revealed 323 and 213 entries,
respectively, and these sets include those provided by CENSOR. It is evident that the Shaidurov’s
method is more effective than the alignment, since it identified a number of entries omitted by
CENSOR; Fig. 1 illustrates this point. Here, the upper sequence corresponds to the genome, and the
lower one corresponds to the transposon.
    The advantages provided by the Shaidurov’s method are shown in Figure 2(a) and 2(b). Figure 2(a)
shows the Shaidurov’s method output for the Copia-18 BD-I (length is 319 symbols) search with respect
to point mutations with no insertions or deletions. Figure 2(b) shows similar results with the
insertion/deletion incorporation.


                                                    a)


                                              b)
Figure 2: An example of a fuzzy coincidence found using the Shaidurov’s method but with a
CENSOR failure in the search

   The X axis shows the number of genomes with the pattern found in them and falling into the
corresponding interval. The convolution calculation supports the identification of the transposon
embedding both with and without insertions or deletions. The identification of the transposon pattern
without insertions and/or deletions allows one to determine the exact number of perfectly matching
symbols equal to the transposon length. The inclusion of insertions or deletions into the pattern search
worsens the estimation of the exactly matching symbols, and the former becomes a probabilistic one.
The estimation goes down as the number of insertions/deletions grows up. Thus, the X axis in Fig.
2(b) shows the intervals of estimating the number of coincidences instead of the exact interval values.
Fig. 2(a) unambiguously shows that Copia-18 BD-I is sure to be found in two genomes, since there
are 316 coincidences among 319 ones. On the other hand, this figure shows that the greatest number
of genomes (134 entities) has 305 to 309 coincidences among 319.
    The results presented above unambiguously prove the presence of transposons in chloroplast
genomes; however, their abundance and diversity is significantly lower in comparison with other
genetic systems. Here, we are not able to make an immediate comparison of the alignment efficiency
as opposed to the Shaidurov’s method since the CENSOR output yields the so-called score (which is a
numeric evaluation of the “quality” of the alignment), which is inspired by users from the view point
of the rule for its derivation.
    The results of this paper prove the high efficiency of the Shaidurov’s method in the analysis of the
biologically inspired comparison of symbol sequences and pattern search including insertions and
deletions. Apparently, the greatest advantage of the Shaidurov’s method is the complete absence of
hidden or free parameters to be used to adjust the comparison.
    Two problems make an essential obstacle in broader applications of the method. The former is
signal/noise ratio improvement, and the latter is the localization of the sites of interest. Surprisingly,
both problems could be significantly addressed with the expansion of the alphabet. Initially, we used a
single letter alphabet to derive the binary sequences (that is, four letters, in the case of nucleotide
sequences). However, one can assign the duplets, triplets, etc., as the letters of the new extended
alphabet, doing the same job. Of course, it yields an exponential growth of the number of binary
sequences to be processed. Meanwhile, they could be processed in parallel, so it is not a problem.
Since the convolution value represents the number of exactly matching symbols throughout the
overlap of two sequences regardless of the exact location of the matched symbols, then the expansion
of the alphabet will result in a significant decrease of the number of exactly matching k-tipples, thus
providing the better signal/noise ratio and the localization of the longer highly similar subsequences in
the compared entities.

3. Conclusion
    The paper presents a new method for comparison and analysis of symbol sequences. The method
is based on the convolution function calculation defined over the binary numeric sequences derived
from the original symbol sequence. The method provides highly parallel implementation and is very
powerful in insertion or deletion mutations search. A discrete fast Fourier transform is implemented
for convolution calculation. Also, an idea of the alphabet expansion is proposed to improve the
signal/noise ratio. Some genomic applications are provided and discussed. The applications are used
to illustrate and overcome the problem of signal/noise selection, and alignment localization.

4. Acknowledgements
   This work is supported by the Krasnoyarsk Mathematical Center and financed by the Ministry of
Science and Higher Education of the Russian Federation in the framework of the establishment and
development of regional Centers for Mathematics Research and Education (Agreement No. 075-02-
2021-1384).

5. References
[1] J. L. Bennetzen, Transposable element contributions to plant gene and genome evolution, Plant
    molecular biology 42(1) (2000) 251–269.
[2] J. Jurka, P. Klonowski, V. Dagman, P. Pelton, Censor – a program for identification and
    elimination of repetitive elements from DNA sequences, Computers & chemistry 20(1)
    (1996) 119–121.
[3] V.I. Levenshtein, On perfect codes in deletion and insertion metric, Discrete Mathematics and
    Applications 2(3) (1992) 241–258.
[4] V.I. Levenshtein, Efficient reconstruction of sequences from their subsequences or
    supersequences, Journal of Combinatorial Theory, Series A 93(2) (2001) 310–332.
[5] V.I. Levenshtein, Bounds for deletion/insertion correcting codes, in: Proceedings IEEE
    International Symposium on Information Theory., 2002, p. 370.
[6] A. Molyavko, V. Shaidurov, E. Karepova, M. Sadovsky: Highly parallel convolution method to
    compare DNA sequences with enforced in/del and mutation tolerance, in: International Work-
    Conference on Bioinformatics and Biomedical Engineering, Springer, 2020, pp. 472–481.
[7] S. P. Tsarev, M. G. Sadovsky, New error tolerant method for search of long repeats in DNA
    sequences, in: International Conference on Algorithms for Computational Biology, Springer,
    2016, pp. 171–182.
[8] S. P. Tsarev, M. Y. Senashova, M. G. Sadovsky, Fast algorithm for vernier search of long repeats
    in dna sequences with bounded error density, in: International Conference on Algorithms for
    Computational Biology, Springer, 2018, pp. 88–99.
[9] T. Wicker, H. Gundlach, M. Spannagl, C. Uauy, P. Borrill, R.H. Ram´ırez-Gonza´lez, R. De
    Oliveira, K.F. Mayer, E. Paux, F. Choulet, Impact of transposable elements on genome structure
    and evolution in bread wheat. Genome biology 19(1) (2018) 1–18.