Machine Learning Applications for Genomic Pattern
                    Recognition Problem*

                                Elen Tevanyan and Maria Poptsova

                          National Research University Higher School of Economics,
                                Myasnitskya str. 20, 101000, Moscow, Russia
                              etevanian@hse.ru; mpoptsova@hse.ru


          Abstract. DNA secondary structures are important functional elements that
          may influence cellular processes. One of their possible functions is regulation
          of nucleosome positioning. Here MNAse-seq and ssDNA-seq data were used to
          define patterns of positional relationship of DNA structures such as Z-DNA, H-
          DNA and G-quadruplexes with nucleosomes. Three types of patterns were
          found: a structure is surrounded by nucleosomes from both sides, from one side,
          or nucleosome free region. Machine-learning models based on Random forest
          algorithm and XGBoost were trained to recognize DNA region of 500 bp length
          containing a pattern of nucleosome positioning for three types of DNA struc-
          tures (Z-DNA, H-DNA and G-quadruplexes) based on DNA sequence composi-
          tional properties. The best performance (more than 86% for ROC-AUC, accu-
          racy, recall and presicion scores) was reached for G-quadruplexes. 500 bp re-
          gions containing G-quadruplexes have distinct compositional properties and
          point to the preferential locations of the defined patterns, which regulatory
          functions require further investigation. For other DNA structures a region com-
          position is less powerful predictive factor and one should take into account oth-
          er physical and structural DNA properties to improve nucleosome-DNA-
          structure pattern recognition.

          Keywords: DNA structures, nucleosome positioning, machine-learning meth-
          ods, random forest, xgboost


1         Introduction

Machine learning is widely applied to problems in genomic research1. Computational
methods successfully annotate genomes with functional elements, such as transcrip-
tion start sites [1], splice-sites [2], alternative splicing [3], promoters, enhancers [4].
Recent advancements in computational performance enable to predict nucleosome
positioning using models based on neural networks [5]. However, it remains a chal-
lenging task to detect non-B-DNA structures and determine their function with ma-
chine learning algorithms due to the absence of experimentally confirmed genome-

*
    Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribu-
     tion 4.0 International (CC BY 4.0)
2

wide data on many types of structures. As a result, patterns of DNA structures and
their positioning with respect to other elements are hard to detect as well. Despite the
limitation, non-B-DNA structures might influence chromatin reorganization by gov-
erning nucleosome positioning, thus, regulating transcription that makes pattern
recognition of DNA structures and nucleosomes positioning an important task.
   A model of right-handed double helix DNA molecule known as B-DNA was first
proposed in [6]. Although B-DNA conformation is widely spread and is considered as
canonical, more than ten types of DNA secondary structures are discovered: A-DNA,
Z-DNA, H-DHA, V-DNA, stem-loops, G-quadruplexes, i-motif, buldge-DNA, etc
[7]. Due to the scarcity of experimental data, the research is focused on three types of
DNA structures: Z-DNA, H-DNA, and G-quadruplexes.
   To begin with, left-handed double helix called Z-DNA is among the most studied
DNA conformation and found both in vitro [8] and in vivo [9]. Z-DNA has the poten-
tial to be involved in transcription. It was shown [9] that three regions near the pro-
moter of gene C-MYC have adopted Z-DNA conformation while the gene was active-
ly transcribed. As for the in silico detecton, Z-Hunt algorithm is mainly used to de-
scribe genomic region’s potential to form a left-handed helix.
   The next structure of interest is H-DNA. Triple helix consists of a usual double he-
lix, which is connected to separate single DNA strand either from its part or from
another molecule [10]. The existence in vitro [11] was confirmed a short while after
Watson and Crick discovery while in vivo proofs were found later [12]. Researches
point out H-DNA involvement in replication, transcription reparation and homolo-
gous recombination [13]. For example, the study implies H-DNA acts as a barrier for
replication [14]. The detection of H-DNA motives is based on the primary sequence
content: algorithms search for inverted repeats.
   As for G-quadruplexes, they also exist both in vitro [15] and in vivo [16]. In silico
detection uses particular motif composition to classify a region as G-quadruplex
adopting [17]. The biological function of G-quadruplexes is actively investigated.
Recent studies highlight G-quadruplexes regulate transcription and replication [18]. In
contrast to the absence of genome-wide data on other secondary structures, a tech-
nique called G4-seq is developed which maps genomic regions in G-quadruplex con-
formation.
   DNA molecules are compactly packed in the cells and are organized into chroma-
tin. Double-stranded DNA wraps around histone proteins forming nucleosomes [19].
When a nucleosome is formed, the underlying DNA region is inactive, it cannot be
transcribed as transcription factors cannot bind DNA. Nucleosomes positioning is
influenced by many factors: DNA sequence itself, histone modifications, remodeling
complexes, transcription, replication.
   As stated above, secondary DNA structures may influence transcription in cells by
regulating nucleosome positioning. It was experimentally confirmed the only B-DNA
can wrap around nucleosome which makes impossible for non-B-DNAs to bind his-
tones [20]. There is an evidence that Z-DNA and H-DNA govern nucleosome posi-
tioning acting as a barrier [21] while G-quadruplexes are formed in nucleosome-free
regions [22].
                                                                                       3

   Machine learning methods are applied to determine nucleosomal profiles. The
availability of genome-wide data on nucleosome positioning has led to the develop-
ment of different models. The first papers in this field use simple techniques of statis-
tical analysis to calculate the probability of nucleosome formation [23] with perfor-
mance quality of 50%. Later studies describe nucleosome positioning as machine
learning classification task and apply SVM and Random Forest algorithms, achieving
the prediction power more than 80%. Recent studies focus on convolutional neural
networks [5], which achieve accuracy, precision, and recall of more than 90%.
   Classical machine learning approach supposes the sample to be described with fea-
tures. The DNA sequence consists of only 4 units known as bases: A, C, G, T. Due to
the specific nature of a sample in genomic studies, a region of DNA sequence is con-
sidered as a string and methods of feature extraction similar to a text analysis are ap-
plied. One of the simplest and effective strategies is to examine k-tuple nucleotide
composition where k usually varies from 1 to 6. Another powerful approach is to
describe DNA sequence with physical and chemical properties of base pairs which are
presented in the publicly available databases. K-neighbors characteristics are used as
features as well.
   The literature analysis reveals several facts which are important for this paper.
First, high-throughput techniques are developed only for G-quadruplexes, so data is
available only for this type of structure, for other types of DNA structures computa-
tional methods are needed. Second, non-B-DNA structures are involved in the main
cell processes like transcription and replication. What is more important, they prevent
forming nucleosomes. Third, machine learning methods are used to define nucleoso-
mal profiles.
   To our knowledge, machine learning algorithms are trained to detect either nucleo-
somes or secondary DNA structures. This paper aims to recognize patterns of DNA
structures and nucleosomes positioning. The results may lead to better understanding
of chromatin remodeling mechanisms and how transcription is regulated by non-B-
DNA.


2      Methods

2.1    Genome Computational Annotation

To analyze patterns of DNA secondary structures and nucleosome positioning the
data on mouse genome is used. The mm9 version of genome is available at UCSC
Genome Browser [24].The genome is annotated with three types of structures: Z-
DNA, H-DNA and Q-quadruplexes. The table below represents the software used for
each type of structures.

                                        Table 1.

            Structure             Software
            Z-DNA                 Z-Hunt [25]
4


            H-DNA                Inverted Repeats Finder [26]
            Q-quadruplexes       QuadParser [17]


2.2    Genome In Vivo Annotation

In vivo detection of secondary structures is a challenging task. However, the study
presents [27] the design of ss-DNA-seq experiment on mouse B-cells to obtain ge-
nome-wide locations of DNA secondary structures. The data is available at Laborato-
ry’s Research Page [28]. The reads are aligned to mouse genome with Bowtie soft-
ware, version 0.12.7 [29].
   Then both computational annotations and in vivo detected structures are intersected
to define non-putative motives of DNA secondary structures. The intersection is done
with Bedtools [30] software of version 2.27.1.


2.3    Nucleosome Data

The nucleosome positioning profile is the result of MNAse-seq on mouse B-cells data
analysis provided by the study [27]. All the details are described in the paper’s meth-
ods while the data is available at NCBI under the SRA identifier SRA072844. The
data is preprocessed according to Illumina Analysis pipeline. The reads are aligned to
the mouse genome with Bowtie software, version 0.12.7.
   After the alignment each read is lengthened up to 146 base pairs in 3’ direction and
is considered as a nucleosome forming region in the particular cell line.


2.4    Patterns of DNA structures and nucleosomes

The region of interest is a sequence of 500 bp length centered on the secondary struc-
ture. For that region the coverage with MNAse preprocessed data is calculated to
discover the coverage density. The average coverage of the genome is computed
based on randomly selected 200 000 regions. Any region of interest, which is covered
by more than the average coverage, is further inspected for the type of pattern. The
average coverage is compared with t-test. Regions, which fail the test, are considered
as nucleosome-free (pattern 0).
   The region of interest is split into three parts: the center (DNA secondary struc-
ture), the right side (250 bp), the left side (250 pb). The maximum coverages within
each part are compared with each other. If all of them have close values, then the
region is classified as nucleosome-free (pattern 0). If the peaks on both right and left
sides are higher than that in the center, then the structure is surrounded by two nucle-
osomes (pattern 1). Following the same procedure, the pattern with a nucleosome on
one side is defined (pattern 2). For the simplicity reason pattern 1 and pattern 2 are
merged into one category.
                                                                                         5

2.5    Machine Learning Task

Let x be the sample representing a region of interest – a sequence of 500 bp length
centered on a secondary structure. Let y be the pattern which the region is associated
with and let y be considered as the class of the sample. The aim is to train a classifier
which can predict the pattern of any particular region. In other words, the model de-
fines the type of pattern of the region.


2.6    Feature extraction

It is a common problem to express a genomic region via feature vectors which can be
handled by classical machine learning algorithms. As for the task in this paper, the
sample represents a string of length of around 530 letters as the genome sequence
consists of 4 elements: A, C, G, T. K-tuple nucleotide composition with k equal to 2
and 3 is used as the feature extraction strategy. In other words, each sequence is de-
scribed with 80 features: 16 for the quantity of a particular dinucleotide and 64 for
the triplets. One feature as GC-content is added to the dataset.


2.7    Machine Learning Algorithms

Two algorithms are used for the classification task: Random Forest [31] and XGBoost
[32]. For both algorithms the following is true: the dataset is split into the training set
and the test set in the proportion of 70-30% . The training set is used to validate algo-
rithms with the 5-fold cross validation strategy. Different parameters are tested with
randomized search strategy.


Random Forest Classifier
   Algorithm available in the scikit-learn library [33] of version 0.20.01 was used in
this study for the classification task. To find the best model, the number of trees is
varied from 10 to 100.


XGBoost
  Open-source library [32] for Python is used with parameters varied:
     • 𝜆 from 0 to 1 with the step 0.1
     • 𝜃 from 0 to 1 with the step 0.1
     • 𝜂 from 0 to 0.5 with the step 0.25


2.8    Model evaluation

A set of quality measures are used to evaluate models:

• Accuracy
• Recall
6

• Precision
• ROC-AUC

   During the algorithms’ optimization ROC-AUC score was used as the scoring
function.


3      Results and Discussion

The investigation of the role of DNA secondary structures on nucleosome positioning
and pattern search requires data on structures and nucleosome maps. The computa-
tional annotations of the mouse genome with DNA secondary structures were com-
bined with ssDNA-seq data on B-cells. As it can be seen from table, it results in many
putative motives of non-B-DNA (Table 2). The results are not surprising because the
formation of alternative structure requires a set of conditions, wereas some genomic
regions can be tightly packed.

Table 2. Results of computer annotations of the mouse genome with DNA secondary structures
                           and enrichment with ssDNA-seq data.

Heading level          Z-DNA                 H-DNA                   G-quadruplexes
Nu of computer
                       249 752               320 585                 263 167
predicted structures
Nu of predicted
structures enriched    25 062 (10%)          17 109 (5.3%)           20 259 (7.7%)
in ssDNA

Then the motives enriched with ssDNA-seq data are determined that are used to find
patterns of association with nucleosomes. The analysis of 500 bp regions centered on
a DNA secondary structure determined based on MNAse-seq data together with ssD-
NA-seq reveals three types of patterns:
     1) The region is nucleosome free (pattern 0)
     2) The structure is surrounded by one side with nucleosome (pattern 1)
     3) The structure is surrounded by both sides (pattern 2)
   The patterns are illustrated on fig.1.


A                                B                               C
                                                                                                7

 Fig. 1. Three types of patterns: A) nucleosome-free B) structure surrounded with one nucleo-
                       some C) structure surrounded by two nucleosomes

   The most important observation is that nucleosome is never located on a structure
in actively transcribed cells. The biological hypothesis is that the pattern 1 and the
pattern 2 are involved in regulation processes. Secondary structures may act as barri-
ers preventing nucleosome formation or blocking nucleosome movement. Evidences
of that kind of behavior are reported in the literature. For example, the chromatin
remodeling complex freed DNA from histone proteins and left the DNA in the condi-
tion which favor Z-DNA structure [9].
   For the reason of simplicity, the pattern 1 and the pattern 2 are united into one
which is further designated as the class 1, while nucleosome free regions are denoted
as class 0.
   The distribution of classes among different types of structures is shown in table 3.


                               Table 3. Distribution of Classes

Heading level          Z-DNA                   H-DNA                   G-quadruplexes
Nucleosome Free
                       12 889 (51%)            9 816 (57%)             13 173(65%)
Region (Class 0)
Nucleosome From
One Side or From       12 173 (49%)            7 293 (43%)              7 086(35%)
Two Sides (Class 1)

   The aim of a machine learning application for this task is to distinguish genomic
regions with DNA secondary structures with regulation pattern from non-regulative
structures. For this purpose classifiers are trained for each type of structures. The
results are presented in table 4.

                   Table 4. Results of Machine Learning Models Training

Algorithm          Measure            Z-DNA              H-DNA              G-quadruplex
                   ROC-AUC            0.67               0.81               0.87
Random Forest      Accuracy           0.67               0.82               0.88
                   Recall             0.76               0.9                0.93
                   Precision          0.64               0.81               0.89
                   ROC-AUC            0.67               0.81               0.86
                   Accuracy           0.66               0.82               0.88
XGBoost
                   Recall             0.79               0.88               0.93
                   Precision          0.62               0.81               0.89
   To begin with, both algorithms show almost the same results. Moreover, models
for G-quaruplexes and H-DNA show good performance with prediction quality higher
than 80%. This corresponds to the results of researches which aim to predict nucleo-
8

some positions., and the best results are demonstrated by the models based on neural
network models. In addition, the poorest quality are demonstrated by the classifier
which distinguishes Z-DNA regulatory pattern. The possible reason is the feature set
used for these models. It consists of 2-tuple and 3-tuple nucleotide compositions. G-
quadruplexes and H-DNA are formed in specific sequences, so it is natural to expect
they are well predicted based on sequence content, while Z-DNA has more complex
formation preferences.
   Nevertheless, all the constructed models are significantly better than a random
guessing leading to the idea that more complicated models may result in a better clas-
sification.


4      Conclusion

DNA exists in many forms. Non-B-DNA conformations may be involved in main
molecular processes such as transcription and replication. One of the mechanisms is
the governance of nucleosome positioning. To evaluate the existence of positional
relationship between DNA structures and nucleosomes the data on nucleosome and
DNA structure maps were combined and then machine learning models were trained
to predict the patterns for a genomic region. Both Random Forest classifier and
XGBoost classifier showed good performance on G-quaruplexes and H-DNA while
the quality of the model for Z-DNA is not high.
   The practical applications of the obtained results could arise from the abilities of
non-B DNA structures serve as targets for drugs, and in this respect it is important to
understand the extent of the distribution of patterns involving DNA secondary struc-
tures across the entire genome. Thus, controlling the formation of non-B DNA struc-
tures may promote or inhibit production of harmful proteins including oncoproteins.
   Using G-quadruplexes as targets for drugs is widely discussed in literature [34-36].
Specifically, many quadruplexes are found in promoters of oncogenes, and targeting
quadruplexes by small ligands is considered as potential anticancer therapy [34]. Al-
so, G-quadruplexes are found in regulatory regions of viral genomes, and it opens a
possibility to use them as targets in antiviral therapy. Z-DNA is also found in genomic
regulatory regions and there are proteins that bind specifically Z-DNA [37]. Increased
transcription of some oncogenes was associated with Z-DNA formation. H-DNA, or
triplex DNA, is a form where RNA binds directly to double-stranded DNA. The regu-
latory potential of RNA is huge, and therapeutic potential is also high including anti-
cancer therapy [38]. Overall, all the classes of non-B DNA structures can potentially
be used in biomedical applications, and developing computational approaches could
help in the design of experiments.


References
 1. Libbrecht, M., Noble, W.: Machine learning applications in genetics and genomics. Nature
    Reviews Genetics. 16, 321-332 (2015).
                                                                                             9

 2. Degroeve, S., De Baets, B., Van de Peer, Y., Rouze, P.: Feature subset selection for splice
    site prediction. Bioinformatics. 18, S75-S83 (2002).
 3. Barash, Y., Calarco, J., Gao, W., Pan, Q., Wang, X., Shai, O., Blencowe, B., Frey, B.: De-
    ciphering the splicing code. Nature. 465, 53-59 (2010).
 4. Heintzman, N., Stuart, R., Hon, G., Fu, Y., Ching, C., Hawkins, R., Barrera, L., Van Cal-
    car, S., Qu, C., Ching, K., Wang, W., Weng, Z., Green, R., Crawford, G., Ren, B.: Distinct
    and predictive chromatin signatures of transcriptional promoters and enhancers in the hu-
    man genome. Nature Genetics. 39, 311-318 (2007).
 5. Zhang, J., Peng, W., Wang, L.: LeNup: learning nucleosome positioning from DNA se-
    quences with improved convolutional neural networks. Bioinformatics. 34, 1705-1712
    (2018).
 6. Svozil, D., Kalina, J., Omelka, M., Schneider, B.: DNA conformations and their sequence
    preferences. Nucleic Acids Research. 36, 3690-3706 (2008).
 7. Widom, J.: The Genomic Code for Nucleosome Positioning. Biophysical Journal. 98, 608a
    (2010).
 8. Wang, A., Quigley, G., Kolpak, F., Crawford, J., van Boom, J., van der Marel, G., Rich,
    A.: Molecular structure of a left-handed double helical DNA fragment at atomic resolution.
    Nature. 282, 680-686 (1979).
 9. Rich, A., Zhang, S.: Z-DNA: the long road to biological function. Nature Reviews Genet-
    ics. 4, 566-572 (2003).
10. Frank-Kamenetskii, M., Mirkin, S.: Triplex DNA Structures. Annual Review of Biochem-
    istry. 64, 65-95 (1995).
11. Felsenfeld, G., Davies, D., Rich, A.: Formation of a Three-Stranded Polynucleotide Mole-
    cule. Journal of the American Chemical Society. 79, 2023-2024 (1957).
12. Zain, R., Sun, J.: Do natural DNA triple-helical structures occur and function in vivo?.
    Cellular and Molecular Life Sciences. 60, 862-870 (2003).
13. Jain, A., Wang, G., Vasquez, K.: DNA triple helices: Biological consequences and thera-
    peutic potential. Biochimie. 90, 1117-1130 (2008).
14. Hoyne, P., Maher, L.: Functional Studies of Potential Intrastrand Triplex Elements in the
    Escherichia coli Genome. Journal of Molecular Biology. 318, 373-386 (2002).
15. Gellert, M., Lipsett, M., Davies, D.: Helix Formation by Guanilic Acid. Proceedings of the
    National Academy of Sciences. 48, 2013-2018 (1962).
16. Zhao, J., Bacolla, A., Wang, G., Vasquez, K.: Non-B DNA structure-induced genetic in-
    stability and evolution. Cellular and Molecular Life Sciences. 67, 43-62 (2009).
17. Huppert, J.: Prevalence of quadruplexes in the human genome. Nucleic Acids Research.
    33, 2908-2916 (2005).
18. Hänsel-Hertsch, R., Di Antonio, M., Balasubramanian, S.: DNA G-quadruplexes in the
    human genome: detection, functions and therapeutic potential. Nature Reviews Molecular
    Cell Biology. 18, 279-284 (2017).
19. Epstein, R.: Human molecular biology. Cambridge University Press, Cambridge (2003).
20. Garner, M., Felsenfeld, G.: Effect of Z-DNA on nucleosome placement. Journal of Molec-
    ular Biology. 196, 581-590 (1987).
21. Westin, L., Blomquist, P., Milligan, J., Wrange, Ö.: Triple helix DNA alters nucleosomal
    histone-DNA interactions and acts as a nucleosome barrier. Nucleic Acids Research. 23,
    2184-2191 (1995).
22. Hänsel-Hertsch, R., Beraldi, D., Lensing, S., Marsico, G., Zyner, K., Parry, A., Di Anto-
    nio, M., Pike, J., Kimura, H., Narita, M., Tannahill, D., Balasubramanian, S.: G-
    quadruplex structures mark human regulatory chromatin. Nature Genetics. 48, 1267-1272
    (2016).
10

23. Widom, J.: The Genomic Code for Nucleosome Positioning. Biophysical Journal. 98, 608a
    (2010).
24. UCSC Genome Browser Downloads, http://hgdownload.cse.ucsc.edu/downloads.html.
25. Champ, P., Maurice, S., Vargason, J., Camp, T., Ho, P.: Distributions of Z-DNA and nu-
    clear factor I in human chromosome 22: a model for coupled transcriptional regulation.
    Nucleic Acids Research. 32, 6501-6510 (2004).
26. Inverted Repeats Finder Download Page, http://tandem.bu.edu/irf/irf.download.html.
27. Kouzine, F., Wojtowicz, D., Baranello, L., Yamane, A., Nelson, S., Resch, W., Kieffer-
    Kwon, K., Benham, C., Casellas, R., Przytycka, T., Levens, D.: Permanganate/S1 Nucle-
    ase Footprinting Reveals Non-B DNA Structures with Regulatory Potential across a
    Mammalian Genome. Cell Systems. 4, 344-356.e7 (2017).
28. Teresa                       Przytycka                    Research                  Page,
    https://www.ncbi.nlm.nih.gov/CBBresearch/Przytycka/index.cgi#nonbdna.
29. Bowtie, http://bowtie-bio.sourceforge.net/index.shtml.
30. bedtools: a powerful toolset for genome arithmetic — bedtools 2.27.0 documentation,
    https://bedtools.readthedocs.io/en/latest.
31. Breiman, L.: Random Forests. Machine Learning. 45, 5-32 (2001).
32. XGBoost            Documentation         —         xgboost       0.81      documentation,
    https://xgboost.readthedocs.io/en/latest.
33. scikit-learn: machine learning in Python — scikit-learn 0.20.3 documentation.
    https://scikit-learn.org/0.20.
34. Duchler, M.: G-quadruplexes: targets and tools in anticancer drug design. J Drug Target,
    20, (5), 389-400 (2012).
35. Hurley, L.H., Wheelhouse, R.T., Sun, D., Kerwin, S.M., Salazar, M., Fedoroff, O.Y., Han,
    F.X., Han, H., Izbicka, E., and Von Hoff, D.D.: G-quadruplexes as targets for drug design.
    Pharmacol Ther. 85, (3), 141-158 (2000).
36. Ruggiero, E., and Richter, S.N.: G-quadruplexes and G-quadruplex ligands: targets and
    tools in antiviral therapy. Nucleic Acids Res. 46, (7), 3270-3283 (2018).
37. Shin, S.I., Ham, S., Park, J., Seo, S.H., Lim, C.H., Jeon, H., Huh, J., and Roh, T.Y.: Z-
    DNA-forming sites identified by ChIP-Seq are associated with actively transcribed regions
    in the human genome. DNA Res. (2016)
38. Jain, A., Wang, G., and Vasquez, K.M.: DNA triple helices: biological consequences and
    therapeutic potential. Biochimie. 90, (8), 1117-1130 (2008).