108 Convolution and Fast Fourier Transform to Compare Symbol Sequences* Anna Molyavko1[0000–0003–1016–0131], Evgenia Karepova2[0000–0002–6515–2932], Mikhail Sadovsky2[0000–0002–1807–0715], Vladimir Shaidurov2[0000–0002–7883–5804] and Igor Borovikov3[0000–0002–0861–0001] 1 Siberian Federal University, 79 Svobodny st., Krasnoyarsk, 660041, Russia 2 Institute of computational modelling of the Siberian Branch of the Russian Academy of Sciences, 50/44 Akademgorodok, Krasnoyarsk, 660036, Russia 3 Electronic Arts, Redwood City, California, USA {msad, e.d.karepova}@icm.krasn.ru Abstract. We propose a new method to compare and analyze symbol sequences based on the convolution function calculation, where the latter is defined over the binary numeric sequences obtained by a specific transformation of the original symbol sequence. The method allows highly parallel implementation and it is of great value for the insertion/deletion mutations search. To calculate the convolution function, a fast discrete Fourier transform is implemented. Some genomic applications are provided and discussed. The applications are used to illustrate and overcome the problem of signal/noise selection, and alignment localization. Keywords: Pattern Recognition, Anomaly Detection, Parallel Computation, InDel, Genome Comparison, Knowledge Retrieval. 1 Introduction Here, we present a new method to compare and/or search for common subsequences in symbol sequences based on convolution. Additionally, fast Fourier transformation is used to compute the convolution, as well as a special representation of the symbol sequences under consideration. We illustrate it with a few biologically inspired examples. Also, the studies reveal some difficulties in the method implementation, signal/noise extraction and localization of the coinciding subsequences being the most acute among them. Currently, alignment is the most popular method to compare two (or several) sequences, either exact matching, or with some errors. In spite of tremendous investigations both in hardware and software for alignment, this method is still very complicated and has a number of drawbacks which are impossible to eliminate. The worst of them are divergence, arbitrariness in the fine function determination, and * Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 109 very low efficiency for InDel mismatches search (see, e. g. [1, 2, 8, 7] for details). The proposed method is free from these crucial disadvantages. Moreover, it is highly potential for coarse grained parallelism in various implementations. In brief, the method implies the following steps. Preprocessing. Each symbol sequence under consideration must be converted into 𝐾 binary ones, where 𝐾 = |ℵ| is the capacity of the alphabet. Next, one sequence (𝔗 , for certainty) must be inverted. Then, expand each binary sequence to the length 𝐿 = 𝑁 + 𝑁 − 1, adding zeros (upright, for certainty). Finally, expand each binary sequence of the lenth 𝐿 to the nearest upper power of 2, to proceed to a fast Fourier transform (FFT). Processing. Calculation of Fourier images of those binary sequences. Postprocessing. Term by term multiplication of the images, thus obtaining a new sequence. Result. Calculation of the inverse Fourier transform to obtain a convolution. 2 Method Description 2.1 Basic Concepts Convolution S = A ∗ B of two number sequences A = {𝑎 } and B = {𝑏 } is the sequence S{𝑠 } with ∑ 𝑎 𝑏 ,𝑖 < 𝐿 , 𝑠 = ∑ 𝑎 𝑏 ,𝐿 ≤ 𝑖 < 𝑁 , (1) ∑ 𝑎 𝑏 ,𝑁 ≤ 𝑖 . The convolution is a key tool to compare two sequences, to seek for a specific subsequence within the latter. To do the search, we must introduce the convolution S = A ∗ B of the inversed sequence B = {𝑏 } . Following (1), one obtains 𝑠 = 𝑎 𝑏 ,𝑠 =𝑎 𝑏 +𝑎 𝑏 ,𝑠 =𝑎 𝑏 +𝑎 𝑏 + 𝑎 𝑏 , and so on. The convolution is a key tool to compare two sequences, to seek for a specific subsequence within the latter. To do the search, we must introduce the convolution S = A ∗ B of the inversed sequence B = {𝑏 } . Following (1), one obtains 𝑠 = 𝑎 𝑏 ,𝑠 =𝑎 𝑏 +𝑎 𝑏 ,𝑠 =𝑎 𝑏 +𝑎 𝑏 + 𝑎 𝑏 , and so on. A brute force way to calculate the convolution of two sequences is rather hard. To overcome this problem, we consider the convolution as a product of two polynomials (of the power 𝐿 − 1 and 𝑁 − 1, respectively). In other words, we consider two number sequences A and B as sets of coefficients of the corresponding polynomials. Thus, the convolution is converted to a product of two polynomials. The next step comes from the well-known theorem stating that Fourier transform of a convolution is a product of Fourier transforms of the convolution of functions (sequences, in our case) under consideration. Hence, an idea is to apply (fast) Fourier transform to both sequences, multiply the Fourier images, and then to apply the inverse Fourier transform to obtain the convolution of the original sequences. Fourier transform is, in turn, the convolution. Meanwhile, there is a specific algorithm of a 110 very fast calculation of Fourier image of any number sequence called fast Fourier transform (FFT). Let 𝔽 denote FFT; it transforms a number sequence A into a sequence A′ = 𝔽(A) of the same length 𝑁 − 1. Let 𝔽 (A′) = A denote the inverse FFT. Let now introduce the operation X • Y for two number sequences X = {𝑥 } and Y = {𝑦 } of the same length: X • Y = {𝑥 𝑦 } (2) 2.2 Algorithm Description Consider two finite symbol sequences P = {𝑝 } and Q = {𝑞 } from the alphabet ℵ = {A, C, G, T}. The algorithm comprises the following steps. 1. Inverse the sequence Q, yielding Q = {𝑞 } . 2. Change P and Q into |ℵ| (that is 4, in our case) binary sequences, provided by the following: ─ (0,1) sequence P is obtained by the substitution of all the symbols A in P with 1, while all the others are changed for 0; ─ (0,1) sequence P is obtained by the substitution of all the symbols C in P with 1, while all the others are changed for 0; ─ (0,1) sequence P is obtained by the substitution of all the symbols G in P with 1, while all the others are changed for 0; finally ─ (0,1) sequence P is obtained by the substitution of all the symbols T in P with 1, while all the others are changed for 0. Similarly, Q = {𝑞 } must be changed for Q , Q , Q and Q . It should be kept in mind that here Q sequence must be used. 3. Expand the sequences with zeros for further application of FFT to obtain a sequence of the length 𝑁 + 𝐿 − 1. To do this, all 2 × |ℵ| binary sequences must be accomplished with zeros (upright, for certainty) to that length. Additionally, an effective implementation of FFT requires the sequence to be as long as the power of 2, so we must add zeros to obtain the length 𝑁 = 2⌈ ( )⌉ . 4. Apply FFT to each of the binary sequences: P ′ = 𝔽(P ) , P ′ = 𝔽(P ) , P ′ = 𝔽(P ) , P ′ = 𝔽(P ) , Q ′ = 𝔽(Q ) , Q ′ = 𝔽(Q ) , Q ′ = 𝔽(Q ) , Q ′ = 𝔽(Q ) . 5. Following (2), multiply the relevant Q ′ (here 𝜈 runs A, C, G and T) and sum them up: S′ = P ′ • Q ′ + P ′ • Q ′ + P ′ • Q ′ + P ′ • Q ′ . 6. Apply the inverse FFT to S′ to obtaining the convolution S = 𝔽 (S′). 111 3 Results We illustrate the method efficiency by applicaton to the search of transposons in plant mitochondrial genomes. We use the well-established results of the search provided by the Censor software which was carried out at the Siberian Federal University as a diploma project [3]. Transposon is defined as a chromosomal segment which can undergo transposition, especially in a bacterial DNA that can be translocated as a whole between the chromosomal, phage, and plasmid DNA in the absence of a complementary sequence in the host DNA. The typical length of the transposon ranges from 30 to 500 nucleotides. Meanwhile, the typical length of a mitochondrial DNA of a plant is about ∼ 10 nucleotides. Fig. 1(a) shows the total pattern, for the entire chromosome, and Fig. 1(b) shows the detailed site of the exact matching transposon. InDel detection with the convolution comparison technique is shown in Fig. 1(e) (the total pattern, for the entire chromosome), and in Fig. 1(f) (the detailed site of the InDel mismatch). Figs 1(c) and 1(d) show the total (left) and the detailed (right) patterns for the case of 18 point mismatches. Fig. 1. Detection of exact matching with the convolution comparison technique. 112 Since the convolution yields the sequence 𝑁 + 𝑁 − 1 symbols long, then two essential problems arise: the former is localization, and the latter is signal/noise distinction [6]. Let us explain the first problem in few detail. Suppose, there is a common subsequence of the length 𝐿 ≈ √𝑁, where 𝑁 = min{𝑁 , 𝑁 }. Then, the peak indicating the exact (or highly likely) coincidence of two copies of the subsequence appears somewhere at the convolution. The point is that the exact location of this highly scored coincidence could be located within the resulting convolution with accuracy of ±𝐿, depending on the location of a copy of the subsequence in each sequence under comparison. To avoid this discrepancy, we illustrate the efficiency of the method with the search of very small (in terms of the localization problem) template subsequences that are the transposons. Similarly, the noise-to-signal problem is also closely related to the length of the sequences under comparison. The worst case from the point of view of the signal deterioration occurs if two sequences of close lengths are compared; again, the genetic entities we use to illustrate the method allow one to bypass this problem at the moment. Thus, below we provide three typical situations of a search for homology in two genetic sequences: these are exact matching, subsec. 3.1; point mutations (point mismatches), subsec. 3.2 and insertion/deletion mismatch, subsec. 3.3. Yet, a combination of these three situations might bring some problems; thus, some further studies should address them. To show that all these three types of mismatches can be detected by the convolution technique, we use Aegilops speltoides mitochondrial genome (AC NC_022666.1 in GenBank). Three different transposons have been tested against the mitochondrial genome mentioned above. Fig. 1 illustrates the feasibility of the convolution based homology search for these transposons over the mitochondrial genome. In this figure, the left subfigures show the entire convolution trend; the right subfigures show the detailed pattern of the convolution variation at the site of homology. 3.1 Exact Matching Search Exact matching search is the classical problem of pure and applied mathematics. A lot has been done in this area. Figs. 1(a) and 1(b) show the case of exact matching search. Due to the low resolution of the picture (the convolution total length is about 455 000) a point diagram is used in the left part (i. e., for the total convolution trend); the red dot shows the location of the site exactly matching the transposon. Fig. 1(b) shows an inset with the convolution, so that only 200 values are shown which are presented in bars. Obviously, the jump in the convolution is evident, which proves the feasibility of the method. 3.2 Point Mutations Point mutation search has made great progress in the editing distance methodology [4, 5]. Probably, this type of mutations is the most suitable for alignment and relevant approaches. Figs. 1(c) and 1(d) show the result of the convolution based search of a 113 template with 18 point mismatches. Again, Fig. 1(c) shows the entire genome, and the red dot indicates the location of the highest homology to the transposon. Fig. 1(d) shows the detailed pattern of the convolution behavior; here the red bar represents the coincidence of the template and the site in the genome. 3.3 Insertion/Deletion This is the hardest mismatch type from the point of view of detection with regular alignment tools. Figs. 1(e) and 1(f) show the result of the convolution calculation to detect the transposon. Note that the location of the site corresponding to the transposon is absolutely the same, as in the case of the exact search (see Figs. 1(a) and 1(b)). It should be said that here we tested another transposon differing in two insertions. 4 Discussion and Conclusion Here, we present a new method of the homology search in symbol sequences for genetic applications. The method is based on the convolution calculation for digital sequences obtained from the symbol ones through special transformation. The method enables an innovative application of the well-known Fourier transform to dramatically speed up computations for important class of bioinformatics problems. The results shown above demostrates the feasibility and efficiency of the new method to search for homologies in extended genetic sequences. Certainly, the method could be applied for the analysis of sequences of any nature ranging from linguistics to financial time series. Meanwhile, some further improvements are expected: signal/noise discrimination and localization of the site with the homology are the most challenging ones among them. Also, it should be stressed that the method allows coarse grained parallelism: 1. the sequences P and Q where ν runs A, C, G and T could be treated simultaneously, and in parallel; 2. similarly, the greater is the number of the sequences to be compared pairwise, the faster could be the software implementation of the method; 3. parallelism grows up, as the capacity of an alphabet increases. Indeed, for amino acid sequences (that is very important in a number of applications), parallelism may accelerate the methods ten times and faster; 4. finally, very long biological sequences (up to 10 symbols) could be treated by parts, with the subsequent concatenation of the convolutions. The detailed discussion of these issues falls beyond the scope of this paper. Acknowledgments. This work is supported by the Krasnoyarsk Mathematical Center and financed by the Ministry of Science and Higher Education of the Russian Federation in the framework of the establishment and development of regional Centers for Mathematics Research and Education (Agreement No. 075-02-2020- 1631). 114 References 1. Kawam, A., Khatri, S., Datta, A.: A survey of software and hardware approaches to performing read alignment in next generatiob sequencing. IEEE/ACM transactions on computational biology and bioinfirmatics 14(6), 1202–1213 (2017) 2. Kaur, Y., Sohi, N.: Comparison of different sequence alignment methods – A survey. International Journal of Advanced Research in Computer Science 8(5) (2017) 3. Kulishina, Y.: Mobile elements in mitochondrion genomes of Poacea. Bachelor diploma. Siberian Federal University (June 2020) 4. Levenshtein, V.I.: On perfect codes in deletion and insertion metric. Discrete Mathematics and Applications 2(3), 241–258 (1992) 5. Miller, F., Vandome, A., McBrewster, J.: Levenshtein Distance. VDM Publishing (2009), https://books.google.ru/books?id=TTzhQgAACAAJ 6. Molyavko, A., Shaidurov, V., Karepova, E., Sadovsky, M.: Highly parallel convolution mathod to compare DNA sequences with enforced in/del and mutation tolerance. In: International Work-Conference on Bioinfirmatics and Biomedical Engineering. pp. 472– 481. Springer (2020) 7. Ng, H.C., Liu, S., Luk, W.: Reconfigurable acceleration of genetic sequence aignment: A survey of two decades of efforts. In: 2017 27th International Conference on Field Programmable Logic and Applications (FPL). pp. 1–8. IEEE (2017) 8. Wang, X.D., Liu, J.X., Xu, Y., Zhang, J.: A survey of multiple sequence alignment techniques. In: International Conference on Intelligent Computing. pp. 529–538. Springer (2015)