=Paper=
{{Paper
|id=Vol-2667/paper45
|storemode=property
|title=The method of generation barcode for DNA certification of plants and other organisms 
|pdfUrl=https://ceur-ws.org/Vol-2667/paper45.pdf
|volume=Vol-2667
|authors=Olga Kiryanova,Ilya Kiryanov,Liana Akhmetzianova,Bulat Kuluev,Alexey Chemeris
}}
==The method of generation barcode for DNA certification of plants and other organisms ==
<pdf width="1500px">https://ceur-ws.org/Vol-2667/paper45.pdf</pdf>
<pre>
            The method of generation barcode for DNA
             certification of plants and other organisms
           Olga Kiryanova                                           Ilya Kiryanov                               Liana Akhmetzianova
  Ufa State Petroleum Technological                                  Corning, Inc.                         Institute of Petrochemistry and
              University                                       Saint Petersburg, Russia                                Catalisys;
             Ufa, Russia                                         ilya.lsc@gmail.com                       Ufa Federal Research Center, RAS
    olga.kiryanova27@gmail.com                                                                                        Ufa, Russia
                                                                                                                www.lianab@mail.ru
              Bulat Kuluev                                         Alexey Chemeris
Institute of Biochemistry and Genetics;                Institute of Biochemistry and Genetics;
  Ufa Federal Research Center, RAS                       Ufa Federal Research Center, RAS
               Ufa, Russia                                            Ufa, Russia
              kuluev@bk.ru                                        chemeris@anrb.ru

    Abstract—In the current paper a new DNA certification                       Different primers are shown in different colors. Red
method for living organisms was presented. The suggested                     brackets denote the amplicons sizes.
approach is based on unique barcode that identifies a
particular organism. The studies were conducted using several                     It is possible to make predictions of amplicons sizes on
species of crops and model plants (Solanum tuberosum,                        the base of known complete nucleotide sequence of the
Triticum aestivum, Arabidopsis thaliana). The web based                      analyzed organism. This is a complicated task which could
application was developed on the base of the proposed                        not be done manually. For example, a genome with 1 billion
technique.                                                                   pairs of nucleotides has about 103 annealing sites for
                                                                             decamer primers. To solve this problem a web based
    Keywords—polymerase chain reaction, primer design, DNA                   application was developed. The proposed software allows to
certification, barcode, web application                                      determine the annealing positions of primers in the DNA
                                                                             chain indicating the length of amplicons. Since the
                         I. INTRODUCTION                                     probability of obtaining identical results for different
    Polymerase chain reaction (PCR) is an experimental                       genomes is negligible, the obtained data could be represented
method of molecular biology that can significantly increase                  as unique barcode which, in its turn, represents a digital
the quantity of target DNA fragments with specific                           DNA passport [5].
nucleotide sequences in a sample [1]. PCR is widely used in
biological and medical practice to isolate new genes,                                            II. PROBLEM DESCRIPTION
diagnose diseases and for other tasks.                                           The global efforts in creation and promotion of new
    PCR was invented in the midle of the 1980s. Nowadays it                  varieties of agricultural crops requires the modernization of
is the leading method in the field of physical and chemical                  the selection process. Currently existing solutions for DNA
biology.                                                                     certification of plants do not allow to obtain digitized data.
                                                                             The proposed barcode system is based on the polymorphism
    Primers (short DNA fragments consisting of 10-30                         of specific genes (most often the cytochrome oxidase gene).
nucleotides) are important components that affect on success                 Therefore, the detected degree of polymorphism is quite low
of experiments [2]. Primers in PCR must satisfy the main                     and allows us to detect only the relationship of individual
requirements: specificity of amplification process and its                   groups of organisms, as well as their location on the
efficiency. A pair of primers are usually used in PCR.                       evolutionary tree [6-8]. Some recently dispersed species may
However, in some cases a single primer may be sufficient                     not be distinguishable based on analysis of several genes.
since it is involved in forward and reverse primers                          Modern instrumental methods for unambiguous genetic
simultaneously [3]. Such approach with single primer is used                 identification of biological material do not allow to
for DNA polymorphism elucidation. For multiplex PCR,                         determine the difference between plant varieties. The
several primers can be used simultaneously, usually up to 12.                development of a well-reproducible and relatively
More than one pair of oligonucleotide primers at the same                    inexpensive method of DNA certification of varieties and
time leads to the coamplification of DNA matrices with                       their DNA identification is an urgent task. Improved or new
results in multiple PCR products [4]. In this case primers                   solutions for the abovementioned problem could ensure
could be annealed in pairs in all possible combinations. An                  significant economic growth in the agricultural sector of
example of primer annealing in the multiplex PCR is shown                    economy.
on figure 1.
                                                                                 For unambiguous certification and identification we
                                                                             proposed a new approach: to assign unique genetic barcodes
                                                                             to plant varieties based on the detected DNA polymorphism
                                                                             using PCR. It does not require prior knowledge about
                                                                             genome of any plant species.
                                                                                 There are more than 20 methods for detecting DNA
                                                                             polymorphism in plants. However, none of them could
Fig. 1. An example of primer annealing in the multiplex PCR.                 provide true digital data and does not have proper
                                                                             reproducibility [9-12]. The experimental basis of the DNA
                                                                             certification method is a modified PCR based on the RAPD -


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0)
Data Science

Random Amplified Polymorphic DNA amplification                            algorithm and the Boyer-Moore algorithm [15]. The Boyer-
method. It is preferable to perform computer analysis before              Moore algorithm is considered as the fastest among general-
the laboratory experiments conduction. Such computer                      purpose classical algorithms designed to find a substring in a
modeling could assist to determine the places of possible                 string. The main advantage of the Boyer-Moore algorithm is
annealing sites and sizes of reaction products (amplicons).               that the shift is calculated based on the pattern (but not over
                                                                          the line where search is conducted). The pattern comparison
    In order to determine the amplicon size in silico it is               with a fragment of the string occurs from right to left. In
necessary to know positions of direct and reverse primers in              addition, the search pattern is not compared with the source
a nucleotide sequence. After that the distance between these              text in all positions, most of them are skipped as obviously
primers could be determined. That distance is called the                  unsuccessful. General evaluation of the computational
amplicon size and must have from 51 to 500 nucleotides
                                                                          complexity of the linear algorithm – O(nm), where m is the
length inclusive. This range is optimal for most cases of gel
                                                                          length of the search pattern, n the length of the search string.
electrophoresis and sequencing [13]. A search example is
                                                                          General evaluation of the computational complexity of the
shown on figure 2.
                                                                          Boyer-Moore and Knuth-Morris-Pratt algorithms is O(m+n)
                                                                          [16].
                                                                              A comparative analysis of algorithms efficiency was
                                                                          performed using genome with 106 nucleotides. It was shown
                                                                          that the Boyer-Moore algorithm is more suitable for primers
                                                                          search [17]. Thus, the Boyer-Moore algorithm was used to
                                                                          implement the substring (primer) search.
                                                                              It should be mentioned, that one presented search
                                                                          technique was implemented using Python language with JIT
                                                                          Numba-compiler [18].
Fig. 2.    An example of searching forward and reverse primers in a
fragment of nucleotide sequence.                                              On the base of data about the length of amplicons a
                                                                          barcode could be generated. The barcode is represented as a
    Following the above-mentioned logic the proposed                      set of lines which determine the presence of amplicon length
software collects information on all available occurrences of             in the range from 51 to 500 nucleotides. We assumed that
primers and amplicons lengths [14]. An example of result is               this range includes 450 imaginary DNA cells, which may
shown in table 1.                                                         contain DNA (and this will be DNA+-cell) or no DNA
                                                                          (DNA--cell). The presence of one or more DNA fragments
TABLE I. POSITIONS OF ANNEALING FORWARD AND REVERSE PRIMERS IN            with the same size in a specific DNA+-cell is not important
           THE SOLANUM TUBEROSUM GENOME (DATA IS PLACED IN
         ASCENDING ORDER OF AMPLICON SIZE). EXAMPLE OF OUTPUT             since it is a qualitative rather than quantitative analysis.
                                DATA .                                    Thus, the information about each sample can be presented
         GGATCTTT            AAAGATCC                                     from alternating zeros and ones in the selected range of
                                                Amplicon size             lengths taken in the amplicon analysis. For example,
          position             position
      39883835            39884052            217                         consider the range from 101 to 110 nucleotides, where the
                                                                          finding of DNA fragments has the following form: …101-,
      55375264            553775548           284
                                                                          102+, 103-, 104-, 105-, 106-, 107+, 108-, 109-, 110- …. The
      29569657            29569969            312                         numbers denote the size of DNA fragment in nucleotides, (+)
      38393029            38393375            346
                                                                          – presence of a DNA fragment, (-) – absence of a DNA
                                                                          fragment. In binary format the entry for this section will be
      49519668            49520023            355                         as follows: …0100001000.
      41540764            41541163            399                            Visually such data could be conveniently represented as
      8231987             8232448             461                         genetic barcodes in a linear or two-dimensional display. For
                                                                          example, for the data in table 1 the corresponding barcode is
                                                                          shown on figure 3.
    Information about genome is presented as a single file or
collection of files with text data according FASTA standard.
This is the most common format for digital storage of
nucleotide sequences. Nucleotide sequences are stored as                  Fig. 3. Barcode example for Solanum tuberosum PCR reaction with forward
strings of characters “A”, “G”, “C”, “T” and sometimes “N”.               primer GGATCTTT and reverse primer AAAGATCC.
Each letter means the corresponding nucleobase: adenine,
guanine, cytosine, and thymine respectively. “N” means                       The main advantage of the proposed approach is easy
unknown nucleotide. FASTA format allows easy data                         comparison of two independent genetic characteristics. It is
manipulations with sequences using text editors and                       possible to accurately measure the amplicon length after its
programming languages such as Python, Ruby, Perl, etc.                    separation in capillary gel electrophoresis under denaturing
That is why FASTA files are widely used for primers                       conditions.
positions search. According to the FASTA file format                          The obtained data about the primer(s), the analyzed
specification, above mentioned task could be reduced to the
                                                                          genome, and the set of selected amplicons are unique. It is
well-known approach: substring search in a string.
                                                                          completely eliminating the accidental barcode coincidence
    There are several well-known algorithms for substring                 of different samples of strains, races, varieties, breeds, or
search in a string: linear search, Knuth-Morris-Pratt                     individuals. Since the amplicons can have a huge number of


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                       208
Data Science

variants (combinations) of the distribution of these DNA                     In addition to computer analysis it is possible to compare
fragments on DNA+ cells.                                                  wet lab experiments (in vitro found amplicons) and predicted
    The total number of occurrences combinations in such                  PCR outcome by comparing two barcodes.
DNA cells could be calculated as the number of                               Thus, the generated information is a kind of digital
combinations from m to n using the following formula (1):                 passport for varieties, breeds, strains of various
                                                                          organisms [19].
                                    𝑛          𝑚!
                                   𝐶𝑚 =                          
                                           𝑛!(𝑚−𝑛)!                                                 IV. CONCLUSIONS
where C is the total number of probabilistic occurrences                      We proposed a new approach for cataloging/certifying
combinations in DNA cells, m the number of all DNA-cells                  diverse groups of plants and other organisms. These
                                                                          unambiguous certification and identification were carried out
analyzed in the selected range, and n the number of all
                                                                          by assigning unique genetic barcodes to plant varieties based
DNA+-cells.
                                                                          on the detected DNA polymorphism. In addition, this method
    According to the probability theory, the largest number               is applicable for all living organisms besides human. Other
of combinations occurs when half of the cells are occupied                methods are used for DNA identification of an individual
with DNA fragments (225 of 450). In this case the number                  approaches, the most promising for data barcode is
of combinations exceed 10100. This number is more than                    considered to be single-nucleotide DNA polymorphism.
enough for unambiguous DNA certification of any                           Currently, many approaches are used for DNA certification
organism. The probability of a random match of two DNA                    of plant varieties but none of them provides unambiguous
samples with the number of different-sized amplicons equal                digital data. Thus, the suggested approach for DNA
to five is about one case per 1012. Thus, the proposed                    certification (cataloging)/identification of living organisms is
approach is an efficient method for DNA certification of                  unique. In addition, the web application was developed that
cultivars, lines, breeds, and strains.                                    allows to detect the presence of specific primers in the DNA
                                                                          (genomes), determine the size of amplicons that are formed
     III. ABCDNA_GS (AMPLIFIED BAR-CODED DNA                              as a result of PCR, and create the corresponding unique
                 GENOME/SPECIMEN)                                         barcode. In the future, it is planned to translate data into QR
    We have developed the web application with database for               code and use machine learning methods to classify barcodes
storing information about the amplicons and barcode                       and compare related varieties. [20].
generation.                                                                   Web based application allows to catalog wet laboratory
    Input data is: domain (Archaea, Prokaryotes,                          experiments and in silico analysis. The entire genomes of
Eukaryotes), Kingdom (Animals, Plants, Fungus) – only for                 different organisms including Solanum tuberosum, Triticum
Eukaryotes, genome, primer(s), type of DNA amplification                  aestivum, Arabidopsis thaliana available from resource
(RAPD, ISSR, AFLP). The entire genomes of different                       EnsembleGenomes         http://ensemblgenomes.org.       Thus,
organisms including from resource EnsembleGenomes                         without conducting a full-scale experiment it is possible to
http://ensemblgenomes.org as FASTA files.                                 test several primers as well as get an idea of the full-scale
                                                                          experiment success. Due to the uniqueness of the proposed
    The output data is: found amplicons sizes and the                     approach it is possible systematize data for different primers
corresponding barcode.                                                    and DNA sequences without taking into account their natural
   As a result, found amplicons sizes allow to estimate the               affiliation. It was shown that barcoding could enhance the
outcome of any particular PCR experiment.                                 genome comparison by excluding the human factor [21],
                                                                          allows to get digital data about a certain genome, and leads
    In other words, obtained data allows to plan the PCR                  to the intuitive and clear comparison among other digitized
experiment for any genome. In addition, compare                           genomes.
experimentally obtained amplicons with those found as a
result of the program.                                                                              ACKNOWLEDGMENT
    User interface example is shown on the figure 4.                          This research was supported by the Russian Foundation
                                                                          for Basic Research (project № 17-44-020120).
                                                                                                         REFERENCES
                                                                          [1]   B. Glik and G. Pasternak, “Molecular biotechnology. Principles and
                                                                                application,” Moscow: Mir, 2002, 589 p.
                                                                          [2]   R.R. Garafutdinov, Аn.Kh. Baymiev, G.V. Maleev, Ya.I. Alekseev,
                                                                                V.V. Zubov, D.A. Chemeris, O.Yu. Kiryanova, I.М. Gubaydullin,
                                                                                R.T. Matniyazov, A.R. Sakhabutdinova, Yu.M. Nikonorov,
                                                                                B.R. Kuluev, Аl.Kh. Baymiev and A.V. Chemeris, “Variety of PCR
                                                                                primers and principles of their selection,” Biomics, vol. 11, no. 1, pp.
                                                                                23-70, 2019.
                                                                          [3]   B.R. Kuluev, An.Kh. Baymiev, G.A. Gerashchenkov, D.A. Chemeris,
                                                                                V.V. Zubov, A.R. Kuluev, Al.Kh. Baymiev and A.V. Chemeris,
                                                                                “Random priming PCR strategies for identification of multilocus
                                                                                DNA polymorphism in eukaryotes,” Russian Journal of Genetics, vol.
                                                                                54, no. 5, pp. 499-513, 2018.
                                                                          [4]   J.S. Chamberlain, R.A. Gibbs, J.E. Ranier, P.N. Nguyen and
                                                                                C.T. Caskey, “Deletion screening of the Duchenne muscular
                                                                                dystrophy locus via multiplex DNA amplification,” Nucleic Acids
                                                                                Research, vol. 16, no. 23, pp. 11141-11156, 1988.
Fig. 4. The program interface, an example of the input data.


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                             209
Data Science

[5]  “What is FASTA format?” [Online]. URL: https://zhanglab.ccmb.                reaction,” Materials of the XIII Russian scientific Internet conference
     med.umich.edu/FASTA/.                                                        Integration of science and higher education in the field of bio-and
[6] J. Huanga, Q. Xub, Z.J. Suna, G.L. Tanga and Z.Y. Sua, “Identifying           organic chemistry and biotechnology, Ufa, Russia, pp. 153-154, 2019.
     earthworms through DNA barcodes,” Pedobiologia, no. 51, pp. 301-        [15] O.Yu. Kiryanova, L.U. Akhmetzianova and I.M. Gubaydullin,
     309, 2007.                                                                   “Search algorithms in the analysis of nucleotide sequences for
[7] A. Cywinska, S.L. Ball and J.R. deWaard, “Biological identifications          unambiguous identification of genomes,” Bulletin of Bashkir
     through DNA bar-codes,” Proc. R. Soc. Lond. B Biol. Sci., vol. 270,          University, vol. 25, no. 2, pp. 285-289. DOI: 10.33184/bulletin-bsu-
     pp. 313-321, 2003.                                                           2020.2.10.
[8] P.D.N. Hebert, S. Ratnasingham and J.R. deWaard, “Barcoding              [16] D. Gusfield, “Algorithms on Strings, Trees and Sequences: Computer
     animal life: cytochromecoxidase I diver-gences among closely related         Science and Computational Biology,” Cambridge University Press,
     species,” Proc. R. Soc. Lond. B Biol. Sci., vol. 270, pp. 596-599,           2003, 654 p.
     2003. DOI: 10.1016/j.pedobi.2007.05.003.                                [17] T.H. Cormen, Ch.Е. Leiserson, R.L. Rivest and K. Stein,
[9] H. Nybom, K. Weising and B. Rotter, “DNA fingerprinting in botany:            “Algorithms: construction and analysis,” M: Williams, 2005, 801 p.
     past, present, future,” Investigative genetics, vol. 5, no. 1, 2014.    [18] O.Yu. Kiryanova, I.I. Kiryanov, L.U. Akhmetzianova, B.R. Kuluev,
[10] K.N. Babu, M.K. Rajesh, K. Samsudeen, D. Minoo, E.J. Suraby, K.              A.V. Chemeris and I.M. Gubaydullin, “Parallel implementation of
     Anupama and P. Ritto, “Randomly amplified polymorphic DNA                    search       algorithm        for     RNA          guide       design,”
     (RAPD) and derived techniques,” Methods Mol Biol., vol. 1115, pp.            Materials of International conference Parallel Computational
     191-209, 2014. DOI: 10.1007/978-1-62703-767-9_10.                            Technologies (PCT), pp. 52-58, 2020.
[11] N. Jones, H. Ougham, H. Thomas and I. Pasakinskiene, “Markers and       [19] O.Yu. Kiryanova, I.I. Kiryanov, B.R. Kuluev, A.V. Chemeris,
     mapping revisited: finding your gene,” New Phytol., vol. 183, no. 4,         R.R. Garafutdinov and I.M. Gubaydullin, “ABCDNA_GS
     pp. 935-966, 2009. DOI: 10.1111/j.1469-8137.2009.02933.x.                    (Amplifaied Bar-Coded DNA Genome/Specimen)” [Online]. URL:
                                                                                  https://www.fips.ru/registers-doc-view/fips_servlet?DB=EVM&
[12] P. Poczai, I. Varga, M. Laos, A. Cseh, N. Bell, J. P. Valkonen and J.        DocNumber=2020610703&TypeFile=html.
     Hyvönen, “Advances in plant gene-targeted and functional markers: a
     review,” Plant Methods, vol. 9, no. 1, 2013. DOI: 10.1186/1746-         [20] V.V. Arlazarov, K. Bulatov, T. Chernov and V.L. Arlazarov, “MIDV-
     4811-9-6.                                                                    500: a dataset for identity document analysis and recognition on
                                                                                  mobile devices in video stream,” Computer Optics, vol. 43, no. 5,
[13] O.Yu. Kiryanova and A.V. Chemeris, “Modeling the search for                  pp. 818-824, 2019. DOI: 10.18287/2412-6179-2019-43-5-818-824.
     primers in the DNA chain,” Materials of the V International
     conference on information technology and nanotechnology ITNT,           [21] B. Jiang, Y. Zhao, H. Yi, Y. Huo, H. Wu, J. Ren, J. Ge, J. Zhao and F.
     Samara, Russia, pp. 774-778, 2019.                                           Wang, “PIDS: A User-Friendly Plant DNA Fingerprint
                                                                                  Database Management System,” Genes, vol. 11, no. 4, pp. 373, 2020.
[14] O.Yu. Kiryanova, L.U. Akhmetzianova, B.R. Kuluev and I.M.
     Gubaydullin, “Program for searching primers for polymerase chain


VI International Conference on "Information Technology and Nanotechnology" (ITNT-2020)                                                              210

</pre>