=Paper=
{{Paper
|id=Vol-2667/paper45
|storemode=property
|title=The method of generation barcode for DNA certification of plants and other organisms
|pdfUrl=https://ceur-ws.org/Vol-2667/paper45.pdf
|volume=Vol-2667
|authors=Olga Kiryanova,Ilya Kiryanov,Liana Akhmetzianova,Bulat Kuluev,Alexey Chemeris
}}
==The method of generation barcode for DNA certification of plants and other organisms ==
The method of generation barcode for DNA certification of plants and other organisms Olga Kiryanova Ilya Kiryanov Liana Akhmetzianova Ufa State Petroleum Technological Corning, Inc. Institute of Petrochemistry and University Saint Petersburg, Russia Catalisys; Ufa, Russia ilya.lsc@gmail.com Ufa Federal Research Center, RAS olga.kiryanova27@gmail.com Ufa, Russia www.lianab@mail.ru Bulat Kuluev Alexey Chemeris Institute of Biochemistry and Genetics; Institute of Biochemistry and Genetics; Ufa Federal Research Center, RAS Ufa Federal Research Center, RAS Ufa, Russia Ufa, Russia kuluev@bk.ru chemeris@anrb.ru Abstract—In the current paper a new DNA certification Different primers are shown in different colors. Red method for living organisms was presented. The suggested brackets denote the amplicons sizes. approach is based on unique barcode that identifies a particular organism. The studies were conducted using several It is possible to make predictions of amplicons sizes on species of crops and model plants (Solanum tuberosum, the base of known complete nucleotide sequence of the Triticum aestivum, Arabidopsis thaliana). The web based analyzed organism. This is a complicated task which could application was developed on the base of the proposed not be done manually. For example, a genome with 1 billion technique. pairs of nucleotides has about 103 annealing sites for decamer primers. To solve this problem a web based Keywords—polymerase chain reaction, primer design, DNA application was developed. The proposed software allows to certification, barcode, web application determine the annealing positions of primers in the DNA chain indicating the length of amplicons. Since the I. INTRODUCTION probability of obtaining identical results for different Polymerase chain reaction (PCR) is an experimental genomes is negligible, the obtained data could be represented method of molecular biology that can significantly increase as unique barcode which, in its turn, represents a digital the quantity of target DNA fragments with specific DNA passport [5]. nucleotide sequences in a sample [1]. PCR is widely used in biological and medical practice to isolate new genes, II. PROBLEM DESCRIPTION diagnose diseases and for other tasks. The global efforts in creation and promotion of new PCR was invented in the midle of the 1980s. Nowadays it varieties of agricultural crops requires the modernization of is the leading method in the field of physical and chemical the selection process. Currently existing solutions for DNA biology. certification of plants do not allow to obtain digitized data. The proposed barcode system is based on the polymorphism Primers (short DNA fragments consisting of 10-30 of specific genes (most often the cytochrome oxidase gene). nucleotides) are important components that affect on success Therefore, the detected degree of polymorphism is quite low of experiments [2]. Primers in PCR must satisfy the main and allows us to detect only the relationship of individual requirements: specificity of amplification process and its groups of organisms, as well as their location on the efficiency. A pair of primers are usually used in PCR. evolutionary tree [6-8]. Some recently dispersed species may However, in some cases a single primer may be sufficient not be distinguishable based on analysis of several genes. since it is involved in forward and reverse primers Modern instrumental methods for unambiguous genetic simultaneously [3]. Such approach with single primer is used identification of biological material do not allow to for DNA polymorphism elucidation. For multiplex PCR, determine the difference between plant varieties. The several primers can be used simultaneously, usually up to 12. development of a well-reproducible and relatively More than one pair of oligonucleotide primers at the same inexpensive method of DNA certification of varieties and time leads to the coamplification of DNA matrices with their DNA identification is an urgent task. Improved or new results in multiple PCR products [4]. In this case primers solutions for the abovementioned problem could ensure could be annealed in pairs in all possible combinations. An significant economic growth in the agricultural sector of example of primer annealing in the multiplex PCR is shown economy. on figure 1. For unambiguous certification and identification we proposed a new approach: to assign unique genetic barcodes to plant varieties based on the detected DNA polymorphism using PCR. It does not require prior knowledge about genome of any plant species. There are more than 20 methods for detecting DNA polymorphism in plants. However, none of them could Fig. 1. An example of primer annealing in the multiplex PCR. provide true digital data and does not have proper reproducibility [9-12]. The experimental basis of the DNA certification method is a modified PCR based on the RAPD - Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) Data Science Random Amplified Polymorphic DNA amplification algorithm and the Boyer-Moore algorithm [15]. The Boyer- method. It is preferable to perform computer analysis before Moore algorithm is considered as the fastest among general- the laboratory experiments conduction. Such computer purpose classical algorithms designed to find a substring in a modeling could assist to determine the places of possible string. The main advantage of the Boyer-Moore algorithm is annealing sites and sizes of reaction products (amplicons). that the shift is calculated based on the pattern (but not over the line where search is conducted). The pattern comparison In order to determine the amplicon size in silico it is with a fragment of the string occurs from right to left. In necessary to know positions of direct and reverse primers in addition, the search pattern is not compared with the source a nucleotide sequence. After that the distance between these text in all positions, most of them are skipped as obviously primers could be determined. That distance is called the unsuccessful. General evaluation of the computational amplicon size and must have from 51 to 500 nucleotides complexity of the linear algorithm – O(nm), where m is the length inclusive. This range is optimal for most cases of gel length of the search pattern, n the length of the search string. electrophoresis and sequencing [13]. A search example is General evaluation of the computational complexity of the shown on figure 2. Boyer-Moore and Knuth-Morris-Pratt algorithms is O(m+n) [16]. A comparative analysis of algorithms efficiency was performed using genome with 106 nucleotides. It was shown that the Boyer-Moore algorithm is more suitable for primers search [17]. Thus, the Boyer-Moore algorithm was used to implement the substring (primer) search. It should be mentioned, that one presented search technique was implemented using Python language with JIT Numba-compiler [18]. Fig. 2. An example of searching forward and reverse primers in a fragment of nucleotide sequence. On the base of data about the length of amplicons a barcode could be generated. The barcode is represented as a Following the above-mentioned logic the proposed set of lines which determine the presence of amplicon length software collects information on all available occurrences of in the range from 51 to 500 nucleotides. We assumed that primers and amplicons lengths [14]. An example of result is this range includes 450 imaginary DNA cells, which may shown in table 1. contain DNA (and this will be DNA+-cell) or no DNA (DNA--cell). The presence of one or more DNA fragments TABLE I. POSITIONS OF ANNEALING FORWARD AND REVERSE PRIMERS IN with the same size in a specific DNA+-cell is not important THE SOLANUM TUBEROSUM GENOME (DATA IS PLACED IN ASCENDING ORDER OF AMPLICON SIZE). EXAMPLE OF OUTPUT since it is a qualitative rather than quantitative analysis. DATA . Thus, the information about each sample can be presented GGATCTTT AAAGATCC from alternating zeros and ones in the selected range of Amplicon size lengths taken in the amplicon analysis. For example, position position 39883835 39884052 217 consider the range from 101 to 110 nucleotides, where the finding of DNA fragments has the following form: …101-, 55375264 553775548 284 102+, 103-, 104-, 105-, 106-, 107+, 108-, 109-, 110- …. The 29569657 29569969 312 numbers denote the size of DNA fragment in nucleotides, (+) 38393029 38393375 346 – presence of a DNA fragment, (-) – absence of a DNA fragment. In binary format the entry for this section will be 49519668 49520023 355 as follows: …0100001000. 41540764 41541163 399 Visually such data could be conveniently represented as 8231987 8232448 461 genetic barcodes in a linear or two-dimensional display. For example, for the data in table 1 the corresponding barcode is shown on figure 3. Information about genome is presented as a single file or collection of files with text data according FASTA standard. This is the most common format for digital storage of nucleotide sequences. Nucleotide sequences are stored as Fig. 3. Barcode example for Solanum tuberosum PCR reaction with forward strings of characters “A”, “G”, “C”, “T” and sometimes “N”. primer GGATCTTT and reverse primer AAAGATCC. Each letter means the corresponding nucleobase: adenine, guanine, cytosine, and thymine respectively. “N” means The main advantage of the proposed approach is easy unknown nucleotide. FASTA format allows easy data comparison of two independent genetic characteristics. It is manipulations with sequences using text editors and possible to accurately measure the amplicon length after its programming languages such as Python, Ruby, Perl, etc. separation in capillary gel electrophoresis under denaturing That is why FASTA files are widely used for primers conditions. positions search. According to the FASTA file format The obtained data about the primer(s), the analyzed specification, above mentioned task could be reduced to the genome, and the set of selected amplicons are unique. It is well-known approach: substring search in a string. completely eliminating the accidental barcode coincidence There are several well-known algorithms for substring of different samples of strains, races, varieties, breeds, or search in a string: linear search, Knuth-Morris-Pratt individuals. Since the amplicons can have a huge number of VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 208 Data Science variants (combinations) of the distribution of these DNA In addition to computer analysis it is possible to compare fragments on DNA+ cells. wet lab experiments (in vitro found amplicons) and predicted The total number of occurrences combinations in such PCR outcome by comparing two barcodes. DNA cells could be calculated as the number of Thus, the generated information is a kind of digital combinations from m to n using the following formula (1): passport for varieties, breeds, strains of various organisms [19]. 𝑛 𝑚! 𝐶𝑚 = 𝑛!(𝑚−𝑛)! IV. CONCLUSIONS where C is the total number of probabilistic occurrences We proposed a new approach for cataloging/certifying combinations in DNA cells, m the number of all DNA-cells diverse groups of plants and other organisms. These unambiguous certification and identification were carried out analyzed in the selected range, and n the number of all by assigning unique genetic barcodes to plant varieties based DNA+-cells. on the detected DNA polymorphism. In addition, this method According to the probability theory, the largest number is applicable for all living organisms besides human. Other of combinations occurs when half of the cells are occupied methods are used for DNA identification of an individual with DNA fragments (225 of 450). In this case the number approaches, the most promising for data barcode is of combinations exceed 10100. This number is more than considered to be single-nucleotide DNA polymorphism. enough for unambiguous DNA certification of any Currently, many approaches are used for DNA certification organism. The probability of a random match of two DNA of plant varieties but none of them provides unambiguous samples with the number of different-sized amplicons equal digital data. Thus, the suggested approach for DNA to five is about one case per 1012. Thus, the proposed certification (cataloging)/identification of living organisms is approach is an efficient method for DNA certification of unique. In addition, the web application was developed that cultivars, lines, breeds, and strains. allows to detect the presence of specific primers in the DNA (genomes), determine the size of amplicons that are formed III. ABCDNA_GS (AMPLIFIED BAR-CODED DNA as a result of PCR, and create the corresponding unique GENOME/SPECIMEN) barcode. In the future, it is planned to translate data into QR We have developed the web application with database for code and use machine learning methods to classify barcodes storing information about the amplicons and barcode and compare related varieties. [20]. generation. Web based application allows to catalog wet laboratory Input data is: domain (Archaea, Prokaryotes, experiments and in silico analysis. The entire genomes of Eukaryotes), Kingdom (Animals, Plants, Fungus) – only for different organisms including Solanum tuberosum, Triticum Eukaryotes, genome, primer(s), type of DNA amplification aestivum, Arabidopsis thaliana available from resource (RAPD, ISSR, AFLP). The entire genomes of different EnsembleGenomes http://ensemblgenomes.org. Thus, organisms including from resource EnsembleGenomes without conducting a full-scale experiment it is possible to http://ensemblgenomes.org as FASTA files. test several primers as well as get an idea of the full-scale experiment success. Due to the uniqueness of the proposed The output data is: found amplicons sizes and the approach it is possible systematize data for different primers corresponding barcode. and DNA sequences without taking into account their natural As a result, found amplicons sizes allow to estimate the affiliation. It was shown that barcoding could enhance the outcome of any particular PCR experiment. genome comparison by excluding the human factor [21], allows to get digital data about a certain genome, and leads In other words, obtained data allows to plan the PCR to the intuitive and clear comparison among other digitized experiment for any genome. In addition, compare genomes. experimentally obtained amplicons with those found as a result of the program. ACKNOWLEDGMENT User interface example is shown on the figure 4. This research was supported by the Russian Foundation for Basic Research (project № 17-44-020120). REFERENCES [1] B. Glik and G. Pasternak, “Molecular biotechnology. Principles and application,” Moscow: Mir, 2002, 589 p. [2] R.R. Garafutdinov, Аn.Kh. Baymiev, G.V. Maleev, Ya.I. Alekseev, V.V. Zubov, D.A. Chemeris, O.Yu. Kiryanova, I.М. Gubaydullin, R.T. Matniyazov, A.R. Sakhabutdinova, Yu.M. Nikonorov, B.R. Kuluev, Аl.Kh. Baymiev and A.V. Chemeris, “Variety of PCR primers and principles of their selection,” Biomics, vol. 11, no. 1, pp. 23-70, 2019. [3] B.R. Kuluev, An.Kh. Baymiev, G.A. Gerashchenkov, D.A. Chemeris, V.V. Zubov, A.R. Kuluev, Al.Kh. Baymiev and A.V. Chemeris, “Random priming PCR strategies for identification of multilocus DNA polymorphism in eukaryotes,” Russian Journal of Genetics, vol. 54, no. 5, pp. 499-513, 2018. [4] J.S. Chamberlain, R.A. Gibbs, J.E. Ranier, P.N. Nguyen and C.T. Caskey, “Deletion screening of the Duchenne muscular dystrophy locus via multiplex DNA amplification,” Nucleic Acids Research, vol. 16, no. 23, pp. 11141-11156, 1988. Fig. 4. The program interface, an example of the input data. VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 209 Data Science [5] “What is FASTA format?” [Online]. URL: https://zhanglab.ccmb. reaction,” Materials of the XIII Russian scientific Internet conference med.umich.edu/FASTA/. Integration of science and higher education in the field of bio-and [6] J. Huanga, Q. Xub, Z.J. Suna, G.L. Tanga and Z.Y. Sua, “Identifying organic chemistry and biotechnology, Ufa, Russia, pp. 153-154, 2019. earthworms through DNA barcodes,” Pedobiologia, no. 51, pp. 301- [15] O.Yu. Kiryanova, L.U. Akhmetzianova and I.M. Gubaydullin, 309, 2007. “Search algorithms in the analysis of nucleotide sequences for [7] A. Cywinska, S.L. Ball and J.R. deWaard, “Biological identifications unambiguous identification of genomes,” Bulletin of Bashkir through DNA bar-codes,” Proc. R. Soc. Lond. B Biol. Sci., vol. 270, University, vol. 25, no. 2, pp. 285-289. DOI: 10.33184/bulletin-bsu- pp. 313-321, 2003. 2020.2.10. [8] P.D.N. Hebert, S. Ratnasingham and J.R. deWaard, “Barcoding [16] D. Gusfield, “Algorithms on Strings, Trees and Sequences: Computer animal life: cytochromecoxidase I diver-gences among closely related Science and Computational Biology,” Cambridge University Press, species,” Proc. R. Soc. Lond. B Biol. Sci., vol. 270, pp. 596-599, 2003, 654 p. 2003. DOI: 10.1016/j.pedobi.2007.05.003. [17] T.H. Cormen, Ch.Е. Leiserson, R.L. Rivest and K. Stein, [9] H. Nybom, K. Weising and B. Rotter, “DNA fingerprinting in botany: “Algorithms: construction and analysis,” M: Williams, 2005, 801 p. past, present, future,” Investigative genetics, vol. 5, no. 1, 2014. [18] O.Yu. Kiryanova, I.I. Kiryanov, L.U. Akhmetzianova, B.R. Kuluev, [10] K.N. Babu, M.K. Rajesh, K. Samsudeen, D. Minoo, E.J. Suraby, K. A.V. Chemeris and I.M. Gubaydullin, “Parallel implementation of Anupama and P. Ritto, “Randomly amplified polymorphic DNA search algorithm for RNA guide design,” (RAPD) and derived techniques,” Methods Mol Biol., vol. 1115, pp. Materials of International conference Parallel Computational 191-209, 2014. DOI: 10.1007/978-1-62703-767-9_10. Technologies (PCT), pp. 52-58, 2020. [11] N. Jones, H. Ougham, H. Thomas and I. Pasakinskiene, “Markers and [19] O.Yu. Kiryanova, I.I. Kiryanov, B.R. Kuluev, A.V. Chemeris, mapping revisited: finding your gene,” New Phytol., vol. 183, no. 4, R.R. Garafutdinov and I.M. Gubaydullin, “ABCDNA_GS pp. 935-966, 2009. DOI: 10.1111/j.1469-8137.2009.02933.x. (Amplifaied Bar-Coded DNA Genome/Specimen)” [Online]. URL: https://www.fips.ru/registers-doc-view/fips_servlet?DB=EVM& [12] P. Poczai, I. Varga, M. Laos, A. Cseh, N. Bell, J. P. Valkonen and J. DocNumber=2020610703&TypeFile=html. Hyvönen, “Advances in plant gene-targeted and functional markers: a review,” Plant Methods, vol. 9, no. 1, 2013. DOI: 10.1186/1746- [20] V.V. Arlazarov, K. Bulatov, T. Chernov and V.L. Arlazarov, “MIDV- 4811-9-6. 500: a dataset for identity document analysis and recognition on mobile devices in video stream,” Computer Optics, vol. 43, no. 5, [13] O.Yu. Kiryanova and A.V. Chemeris, “Modeling the search for pp. 818-824, 2019. DOI: 10.18287/2412-6179-2019-43-5-818-824. primers in the DNA chain,” Materials of the V International conference on information technology and nanotechnology ITNT, [21] B. Jiang, Y. Zhao, H. Yi, Y. Huo, H. Wu, J. Ren, J. Ge, J. Zhao and F. Samara, Russia, pp. 774-778, 2019. Wang, “PIDS: A User-Friendly Plant DNA Fingerprint Database Management System,” Genes, vol. 11, no. 4, pp. 373, 2020. [14] O.Yu. Kiryanova, L.U. Akhmetzianova, B.R. Kuluev and I.M. Gubaydullin, “Program for searching primers for polymerase chain VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020) 210