<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Machine Learning Applications for Genomic Pattern Recognition Problem*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elen Tevanyan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Poptsova</string-name>
          <email>mpoptsova@hse.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research University Higher School of Economics</institution>
          ,
          <addr-line>Myasnitskya str. 20, 101000, Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>DNA secondary structures are important functional elements that may influence cellular processes. One of their possible functions is regulation of nucleosome positioning. Here MNAse-seq and ssDNA-seq data were used to define patterns of positional relationship of DNA structures such as Z-DNA, HDNA and G-quadruplexes with nucleosomes. Three types of patterns were found: a structure is surrounded by nucleosomes from both sides, from one side, or nucleosome free region. Machine-learning models based on Random forest algorithm and XGBoost were trained to recognize DNA region of 500 bp length containing a pattern of nucleosome positioning for three types of DNA structures (Z-DNA, H-DNA and G-quadruplexes) based on DNA sequence compositional properties. The best performance (more than 86% for ROC-AUC, accuracy, recall and presicion scores) was reached for G-quadruplexes. 500 bp regions containing G-quadruplexes have distinct compositional properties and point to the preferential locations of the defined patterns, which regulatory functions require further investigation. For other DNA structures a region composition is less powerful predictive factor and one should take into account other physical and structural DNA properties to improve nucleosome-DNAstructure pattern recognition.</p>
      </abstract>
      <kwd-group>
        <kwd>DNA structures</kwd>
        <kwd>nucleosome positioning</kwd>
        <kwd>machine-learning methods</kwd>
        <kwd>random forest</kwd>
        <kwd>xgboost</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Machine learning is widely applied to problems in genomic research1. Computational
methods successfully annotate genomes with functional elements, such as
transcription start sites [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], splice-sites [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], alternative splicing [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], promoters, enhancers [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Recent advancements in computational performance enable to predict nucleosome
positioning using models based on neural networks [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, it remains a
challenging task to detect non-B-DNA structures and determine their function with
machine learning algorithms due to the absence of experimentally confirmed
genomewide data on many types of structures. As a result, patterns of DNA structures and
their positioning with respect to other elements are hard to detect as well. Despite the
limitation, non-B-DNA structures might influence chromatin reorganization by
governing nucleosome positioning, thus, regulating transcription that makes pattern
recognition of DNA structures and nucleosomes positioning an important task.
      </p>
      <p>
        A model of right-handed double helix DNA molecule known as B-DNA was first
proposed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Although B-DNA conformation is widely spread and is considered as
canonical, more than ten types of DNA secondary structures are discovered: A-DNA,
Z-DNA, H-DHA, V-DNA, stem-loops, G-quadruplexes, i-motif, buldge-DNA, etc
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Due to the scarcity of experimental data, the research is focused on three types of
DNA structures: Z-DNA, H-DNA, and G-quadruplexes.
      </p>
      <p>
        To begin with, left-handed double helix called Z-DNA is among the most studied
DNA conformation and found both in vitro [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and in vivo [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Z-DNA has the
potential to be involved in transcription. It was shown [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] that three regions near the
promoter of gene C-MYC have adopted Z-DNA conformation while the gene was
actively transcribed. As for the in silico detecton, Z-Hunt algorithm is mainly used to
describe genomic region’s potential to form a left-handed helix.
      </p>
      <p>
        The next structure of interest is H-DNA. Triple helix consists of a usual double
helix, which is connected to separate single DNA strand either from its part or from
another molecule [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The existence in vitro [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] was confirmed a short while after
Watson and Crick discovery while in vivo proofs were found later [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Researches
point out H-DNA involvement in replication, transcription reparation and
homologous recombination [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. For example, the study implies H-DNA acts as a barrier for
replication [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The detection of H-DNA motives is based on the primary sequence
content: algorithms search for inverted repeats.
      </p>
      <p>
        As for G-quadruplexes, they also exist both in vitro [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and in vivo [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. In silico
detection uses particular motif composition to classify a region as G-quadruplex
adopting [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The biological function of G-quadruplexes is actively investigated.
Recent studies highlight G-quadruplexes regulate transcription and replication [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. In
contrast to the absence of genome-wide data on other secondary structures, a
technique called G4-seq is developed which maps genomic regions in G-quadruplex
conformation.
      </p>
      <p>
        DNA molecules are compactly packed in the cells and are organized into
chromatin. Double-stranded DNA wraps around histone proteins forming nucleosomes [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
When a nucleosome is formed, the underlying DNA region is inactive, it cannot be
transcribed as transcription factors cannot bind DNA. Nucleosomes positioning is
influenced by many factors: DNA sequence itself, histone modifications, remodeling
complexes, transcription, replication.
      </p>
      <p>
        As stated above, secondary DNA structures may influence transcription in cells by
regulating nucleosome positioning. It was experimentally confirmed the only B-DNA
can wrap around nucleosome which makes impossible for non-B-DNAs to bind
histones [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. There is an evidence that Z-DNA and H-DNA govern nucleosome
positioning acting as a barrier [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] while G-quadruplexes are formed in nucleosome-free
regions [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
      </p>
      <p>
        Machine learning methods are applied to determine nucleosomal profiles. The
availability of genome-wide data on nucleosome positioning has led to the
development of different models. The first papers in this field use simple techniques of
statistical analysis to calculate the probability of nucleosome formation [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] with
performance quality of 50%. Later studies describe nucleosome positioning as machine
learning classification task and apply SVM and Random Forest algorithms, achieving
the prediction power more than 80%. Recent studies focus on convolutional neural
networks [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which achieve accuracy, precision, and recall of more than 90%.
      </p>
      <p>Classical machine learning approach supposes the sample to be described with
features. The DNA sequence consists of only 4 units known as bases: A, C, G, T. Due to
the specific nature of a sample in genomic studies, a region of DNA sequence is
considered as a string and methods of feature extraction similar to a text analysis are
applied. One of the simplest and effective strategies is to examine k-tuple nucleotide
composition where k usually varies from 1 to 6. Another powerful approach is to
describe DNA sequence with physical and chemical properties of base pairs which are
presented in the publicly available databases. K-neighbors characteristics are used as
features as well.</p>
      <p>The literature analysis reveals several facts which are important for this paper.
First, high-throughput techniques are developed only for G-quadruplexes, so data is
available only for this type of structure, for other types of DNA structures
computational methods are needed. Second, non-B-DNA structures are involved in the main
cell processes like transcription and replication. What is more important, they prevent
forming nucleosomes. Third, machine learning methods are used to define
nucleosomal profiles.</p>
      <p>To our knowledge, machine learning algorithms are trained to detect either
nucleosomes or secondary DNA structures. This paper aims to recognize patterns of DNA
structures and nucleosomes positioning. The results may lead to better understanding
of chromatin remodeling mechanisms and how transcription is regulated by
non-BDNA.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <sec id="sec-2-1">
        <title>Genome Computational Annotation</title>
        <p>
          To analyze patterns of DNA secondary structures and nucleosome positioning the
data on mouse genome is used. The mm9 version of genome is available at UCSC
Genome Browser [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ].The genome is annotated with three types of structures:
ZDNA, H-DNA and Q-quadruplexes. The table below represents the software used for
each type of structures.
        </p>
        <p>Structure
Z-DNA
Q-quadruplexes</p>
        <p>
          Inverted Repeats Finder [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]
        </p>
        <p>
          QuadParser [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Genome In Vivo Annotation</title>
        <p>
          In vivo detection of secondary structures is a challenging task. However, the study
presents [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] the design of ss-DNA-seq experiment on mouse B-cells to obtain
genome-wide locations of DNA secondary structures. The data is available at
Laboratory’s Research Page [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. The reads are aligned to mouse genome with Bowtie
software, version 0.12.7 [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ].
        </p>
        <p>
          Then both computational annotations and in vivo detected structures are intersected
to define non-putative motives of DNA secondary structures. The intersection is done
with Bedtools [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] software of version 2.27.1.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Nucleosome Data</title>
        <p>
          The nucleosome positioning profile is the result of MNAse-seq on mouse B-cells data
analysis provided by the study [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. All the details are described in the paper’s
methods while the data is available at NCBI under the SRA identifier SRA072844. The
data is preprocessed according to Illumina Analysis pipeline. The reads are aligned to
the mouse genome with Bowtie software, version 0.12.7.
        </p>
        <p>After the alignment each read is lengthened up to 146 base pairs in 3’ direction and
is considered as a nucleosome forming region in the particular cell line.
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Patterns of DNA structures and nucleosomes</title>
        <p>The region of interest is a sequence of 500 bp length centered on the secondary
structure. For that region the coverage with MNAse preprocessed data is calculated to
discover the coverage density. The average coverage of the genome is computed
based on randomly selected 200 000 regions. Any region of interest, which is covered
by more than the average coverage, is further inspected for the type of pattern. The
average coverage is compared with t-test. Regions, which fail the test, are considered
as nucleosome-free (pattern 0).</p>
        <p>The region of interest is split into three parts: the center (DNA secondary
structure), the right side (250 bp), the left side (250 pb). The maximum coverages within
each part are compared with each other. If all of them have close values, then the
region is classified as nucleosome-free (pattern 0). If the peaks on both right and left
sides are higher than that in the center, then the structure is surrounded by two
nucleosomes (pattern 1). Following the same procedure, the pattern with a nucleosome on
one side is defined (pattern 2). For the simplicity reason pattern 1 and pattern 2 are
merged into one category.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Machine Learning Task</title>
        <p>Let x be the sample representing a region of interest – a sequence of 500 bp length
centered on a secondary structure. Let y be the pattern which the region is associated
with and let y be considered as the class of the sample. The aim is to train a classifier
which can predict the pattern of any particular region. In other words, the model
defines the type of pattern of the region.
2.6</p>
      </sec>
      <sec id="sec-2-6">
        <title>Feature extraction</title>
        <p>It is a common problem to express a genomic region via feature vectors which can be
handled by classical machine learning algorithms. As for the task in this paper, the
sample represents a string of length of around 530 letters as the genome sequence
consists of 4 elements: A, C, G, T. K-tuple nucleotide composition with k equal to 2
and 3 is used as the feature extraction strategy. In other words, each sequence is
described with 80 features: 16 for the quantity of a particular dinucleotide and 64 for
the triplets. One feature as GC-content is added to the dataset.
2.7</p>
      </sec>
      <sec id="sec-2-7">
        <title>Machine Learning Algorithms</title>
        <p>
          Two algorithms are used for the classification task: Random Forest [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] and XGBoost
[
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]. For both algorithms the following is true: the dataset is split into the training set
and the test set in the proportion of 70-30% . The training set is used to validate
algorithms with the 5-fold cross validation strategy. Different parameters are tested with
randomized search strategy.
        </p>
      </sec>
      <sec id="sec-2-8">
        <title>Random Forest Classifier</title>
        <p>
          Algorithm available in the scikit-learn library [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] of version 0.20.01 was used in
this study for the classification task. To find the best model, the number of trees is
varied from 10 to 100.
        </p>
      </sec>
      <sec id="sec-2-9">
        <title>XGBoost</title>
        <p>
          Open-source library [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] for Python is used with parameters varied:
•  from 0 to 1 with the step 0.1
•  from 0 to 1 with the step 0.1
•  from 0 to 0.5 with the step 0.25
2.8
        </p>
      </sec>
      <sec id="sec-2-10">
        <title>Model evaluation</title>
        <p>A set of quality measures are used to evaluate models:
• Accuracy
• Recall
• Precision
• ROC-AUC</p>
        <p>During the algorithms’ optimization ROC-AUC score was used as the scoring
function.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results and Discussion</title>
      <p>The investigation of the role of DNA secondary structures on nucleosome positioning
and pattern search requires data on structures and nucleosome maps. The
computational annotations of the mouse genome with DNA secondary structures were
combined with ssDNA-seq data on B-cells. As it can be seen from table, it results in many
putative motives of non-B-DNA (Table 2). The results are not surprising because the
formation of alternative structure requires a set of conditions, wereas some genomic
regions can be tightly packed.
Then the motives enriched with ssDNA-seq data are determined that are used to find
patterns of association with nucleosomes. The analysis of 500 bp regions centered on
a DNA secondary structure determined based on MNAse-seq data together with
ssDNA-seq reveals three types of patterns:
1) The region is nucleosome free (pattern 0)
2) The structure is surrounded by one side with nucleosome (pattern 1)
3) The structure is surrounded by both sides (pattern 2)</p>
      <p>The patterns are illustrated on fig.1.
Fig. 1. Three types of patterns: A) nucleosome-free B) structure surrounded with one
nucleosome C) structure surrounded by two nucleosomes</p>
      <p>
        The most important observation is that nucleosome is never located on a structure
in actively transcribed cells. The biological hypothesis is that the pattern 1 and the
pattern 2 are involved in regulation processes. Secondary structures may act as
barriers preventing nucleosome formation or blocking nucleosome movement. Evidences
of that kind of behavior are reported in the literature. For example, the chromatin
remodeling complex freed DNA from histone proteins and left the DNA in the
condition which favor Z-DNA structure [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>For the reason of simplicity, the pattern 1 and the pattern 2 are united into one
which is further designated as the class 1, while nucleosome free regions are denoted
as class 0.</p>
      <p>The distribution of classes among different types of structures is shown in table 3.</p>
      <p>The aim of a machine learning application for this task is to distinguish genomic
regions with DNA secondary structures with regulation pattern from non-regulative
structures. For this purpose classifiers are trained for each type of structures. The
results are presented in table 4.</p>
      <p>Z-DNA</p>
      <p>H-DNA</p>
      <p>G-quadruplex
Algorithm
Random Forest
XGBoost</p>
      <p>Measure
ROC-AUC
Accuracy
Recall
Precision
ROC-AUC
Accuracy
Recall
Precision
0.67
0.67
0.76
0.64
0.67
0.66
0.79
0.62
0.81
0.82
0.9
0.81
0.81
0.82
0.88
0.81
0.87
0.88
0.93
0.89
0.86
0.88
0.93
0.89</p>
      <p>To begin with, both algorithms show almost the same results. Moreover, models
for G-quaruplexes and H-DNA show good performance with prediction quality higher
than 80%. This corresponds to the results of researches which aim to predict
nucleosome positions., and the best results are demonstrated by the models based on neural
network models. In addition, the poorest quality are demonstrated by the classifier
which distinguishes Z-DNA regulatory pattern. The possible reason is the feature set
used for these models. It consists of 2-tuple and 3-tuple nucleotide compositions.
Gquadruplexes and H-DNA are formed in specific sequences, so it is natural to expect
they are well predicted based on sequence content, while Z-DNA has more complex
formation preferences.</p>
      <p>Nevertheless, all the constructed models are significantly better than a random
guessing leading to the idea that more complicated models may result in a better
classification.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>DNA exists in many forms. Non-B-DNA conformations may be involved in main
molecular processes such as transcription and replication. One of the mechanisms is
the governance of nucleosome positioning. To evaluate the existence of positional
relationship between DNA structures and nucleosomes the data on nucleosome and
DNA structure maps were combined and then machine learning models were trained
to predict the patterns for a genomic region. Both Random Forest classifier and
XGBoost classifier showed good performance on G-quaruplexes and H-DNA while
the quality of the model for Z-DNA is not high.</p>
      <p>The practical applications of the obtained results could arise from the abilities of
non-B DNA structures serve as targets for drugs, and in this respect it is important to
understand the extent of the distribution of patterns involving DNA secondary
structures across the entire genome. Thus, controlling the formation of non-B DNA
structures may promote or inhibit production of harmful proteins including oncoproteins.</p>
      <p>
        Using G-quadruplexes as targets for drugs is widely discussed in literature [
        <xref ref-type="bibr" rid="ref34 ref35 ref36">34-36</xref>
        ].
Specifically, many quadruplexes are found in promoters of oncogenes, and targeting
quadruplexes by small ligands is considered as potential anticancer therapy [
        <xref ref-type="bibr" rid="ref34">34</xref>
        ].
Also, G-quadruplexes are found in regulatory regions of viral genomes, and it opens a
possibility to use them as targets in antiviral therapy. Z-DNA is also found in genomic
regulatory regions and there are proteins that bind specifically Z-DNA [
        <xref ref-type="bibr" rid="ref37">37</xref>
        ]. Increased
transcription of some oncogenes was associated with Z-DNA formation. H-DNA, or
triplex DNA, is a form where RNA binds directly to double-stranded DNA. The
regulatory potential of RNA is huge, and therapeutic potential is also high including
anticancer therapy [
        <xref ref-type="bibr" rid="ref38">38</xref>
        ]. Overall, all the classes of non-B DNA structures can potentially
be used in biomedical applications, and developing computational approaches could
help in the design of experiments.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Libbrecht</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noble</surname>
            ,
            <given-names>W.:</given-names>
          </string-name>
          <article-title>Machine learning applications in genetics and genomics</article-title>
          .
          <source>Nature Reviews Genetics</source>
          .
          <volume>16</volume>
          ,
          <fpage>321</fpage>
          -
          <lpage>332</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Degroeve</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Baets</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Van de Peer,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Rouze</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Feature subset selection for splice site prediction</article-title>
          .
          <source>Bioinformatics</source>
          .
          <volume>18</volume>
          ,
          <fpage>S75</fpage>
          -
          <lpage>S83</lpage>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Barash</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Calarco</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shai</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blencowe</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frey</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Deciphering the splicing code</article-title>
          .
          <source>Nature</source>
          .
          <volume>465</volume>
          ,
          <fpage>53</fpage>
          -
          <lpage>59</lpage>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Heintzman</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stuart</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hon</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ching</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hawkins</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barrera</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Calcar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ching</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Green</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crawford</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome</article-title>
          .
          <source>Nature Genetics</source>
          .
          <volume>39</volume>
          ,
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>LeNup: learning nucleosome positioning from DNA sequences with improved convolutional neural networks</article-title>
          .
          <source>Bioinformatics</source>
          .
          <volume>34</volume>
          ,
          <fpage>1705</fpage>
          -
          <lpage>1712</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Svozil</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalina</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Omelka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schneider</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>DNA conformations and their sequence preferences</article-title>
          .
          <source>Nucleic Acids Research</source>
          .
          <volume>36</volume>
          ,
          <fpage>3690</fpage>
          -
          <lpage>3706</lpage>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Widom</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The Genomic Code for Nucleosome Positioning</article-title>
          .
          <source>Biophysical Journal</source>
          .
          <volume>98</volume>
          ,
          <issue>608a</issue>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quigley</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kolpak</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Crawford</surname>
          </string-name>
          , J., van Boom, J., van der Marel, G.,
          <string-name>
            <surname>Rich</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Molecular structure of a left-handed double helical DNA fragment at atomic resolution</article-title>
          .
          <source>Nature</source>
          .
          <volume>282</volume>
          ,
          <fpage>680</fpage>
          -
          <lpage>686</lpage>
          (
          <year>1979</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Rich</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , S.:
          <article-title>Z-DNA: the long road to biological function</article-title>
          .
          <source>Nature Reviews Genetics</source>
          .
          <volume>4</volume>
          ,
          <fpage>566</fpage>
          -
          <lpage>572</lpage>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Frank-Kamenetskii</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mirkin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <source>Triplex DNA Structures. Annual Review of Biochemistry</source>
          .
          <volume>64</volume>
          ,
          <fpage>65</fpage>
          -
          <lpage>95</lpage>
          (
          <year>1995</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Felsenfeld</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davies</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rich</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Formation of a Three-Stranded Polynucleotide Molecule</article-title>
          .
          <source>Journal of the American Chemical Society</source>
          .
          <volume>79</volume>
          ,
          <fpage>2023</fpage>
          -
          <lpage>2024</lpage>
          (
          <year>1957</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Zain</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Do natural DNA triple-helical structures occur and function in vivo?</article-title>
          .
          <source>Cellular and Molecular Life Sciences. 60</source>
          ,
          <fpage>862</fpage>
          -
          <lpage>870</lpage>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vasquez</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>DNA triple helices: Biological consequences and therapeutic potential</article-title>
          .
          <source>Biochimie</source>
          .
          <volume>90</volume>
          ,
          <fpage>1117</fpage>
          -
          <lpage>1130</lpage>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Hoyne</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maher</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Functional Studies of Potential Intrastrand Triplex Elements in the Escherichia coli Genome</article-title>
          .
          <source>Journal of Molecular Biology</source>
          .
          <volume>318</volume>
          ,
          <fpage>373</fpage>
          -
          <lpage>386</lpage>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Gellert</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lipsett</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davies</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Helix Formation by Guanilic Acid</article-title>
          .
          <source>Proceedings of the National Academy of Sciences. 48</source>
          ,
          <fpage>2013</fpage>
          -
          <lpage>2018</lpage>
          (
          <year>1962</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bacolla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vasquez</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Non-B DNA</surname>
          </string-name>
          structure
          <article-title>-induced genetic instability and evolution</article-title>
          .
          <source>Cellular and Molecular Life Sciences. 67</source>
          ,
          <fpage>43</fpage>
          -
          <lpage>62</lpage>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Huppert</surname>
          </string-name>
          , J.:
          <article-title>Prevalence of quadruplexes in the human genome</article-title>
          .
          <source>Nucleic Acids Research</source>
          .
          <volume>33</volume>
          ,
          <fpage>2908</fpage>
          -
          <lpage>2916</lpage>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Hänsel-Hertsch</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Di Antonio,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Balasubramanian</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          :
          <article-title>DNA G-quadruplexes in the human genome: detection, functions and therapeutic potential</article-title>
          .
          <source>Nature Reviews Molecular Cell Biology</source>
          .
          <volume>18</volume>
          ,
          <fpage>279</fpage>
          -
          <lpage>284</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Epstein</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Human molecular biology</article-title>
          . Cambridge University Press, Cambridge (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Garner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Felsenfeld</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Effect of Z-DNA on nucleosome placement</article-title>
          .
          <source>Journal of Molecular Biology</source>
          .
          <volume>196</volume>
          ,
          <fpage>581</fpage>
          -
          <lpage>590</lpage>
          (
          <year>1987</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Westin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blomquist</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milligan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Wrange, Ö.:
          <article-title>Triple helix DNA alters nucleosomal histone-DNA interactions and acts as a nucleosome barrier</article-title>
          .
          <source>Nucleic Acids Research</source>
          .
          <volume>23</volume>
          ,
          <fpage>2184</fpage>
          -
          <lpage>2191</lpage>
          (
          <year>1995</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Hänsel-Hertsch</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beraldi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lensing</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marsico</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zyner</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Di Antonio,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Pike</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Kimura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Narita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Tannahill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Balasubramanian</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          :
          <article-title>Gquadruplex structures mark human regulatory chromatin</article-title>
          .
          <source>Nature Genetics</source>
          .
          <volume>48</volume>
          ,
          <fpage>1267</fpage>
          -
          <lpage>1272</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Widom</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The Genomic Code for Nucleosome Positioning</article-title>
          .
          <source>Biophysical Journal</source>
          .
          <volume>98</volume>
          ,
          <issue>608a</issue>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>24. UCSC Genome Browser Downloads, http://hgdownload.cse.ucsc.edu/downloads.html.</mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Champ</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maurice</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vargason</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Camp</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ho</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Distributions of Z-DNA and nuclear factor I in human chromosome 22: a model for coupled transcriptional regulation</article-title>
          .
          <source>Nucleic Acids Research</source>
          .
          <volume>32</volume>
          ,
          <fpage>6501</fpage>
          -
          <lpage>6510</lpage>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>26. Inverted Repeats Finder Download Page, http://tandem.bu.edu/irf/irf.download.html.</mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Kouzine</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wojtowicz</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baranello</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yamane</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nelson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Resch</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , KiefferKwon,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Benham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Casellas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Przytycka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Levens</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          :
          <article-title>Permanganate/S1 Nuclease Footprinting Reveals Non-B DNA Structures with Regulatory Potential across a Mammalian Genome</article-title>
          .
          <source>Cell Systems</source>
          .
          <volume>4</volume>
          ,
          <fpage>344</fpage>
          -
          <lpage>356</lpage>
          .
          <year>e7</year>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>28. Teresa Przytycka Research Page, https://www.ncbi.nlm.nih.gov/CBBresearch/Przytycka/index.cgi#nonbdna.</mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Bowtie</surname>
          </string-name>
          , http://bowtie-bio.sourceforge.net/index.shtml.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <article-title>bedtools: a powerful toolset for genome arithmetic - bedtools 2</article-title>
          .27.0 documentation, https://bedtools.readthedocs.io/en/latest.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.: Random</given-names>
          </string-name>
          <string-name>
            <surname>Forests</surname>
          </string-name>
          .
          <source>Machine Learning</source>
          .
          <volume>45</volume>
          ,
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          (
          <year>2001</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32. XGBoost Documentation - xgboost
          <volume>0</volume>
          .81 documentation, https://xgboost.readthedocs.io/en/latest.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <article-title>scikit-learn: machine learning in</article-title>
          <source>Python - scikit-learn 0.20</source>
          .3 documentation. https://scikit-learn.
          <source>org/0</source>
          .20.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Duchler</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>G-quadruplexes: targets and tools in anticancer drug design</article-title>
          .
          <source>J Drug Target</source>
          ,
          <volume>20</volume>
          , (
          <issue>5</issue>
          ),
          <fpage>389</fpage>
          -
          <lpage>400</lpage>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Hurley</surname>
            ,
            <given-names>L.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wheelhouse</surname>
          </string-name>
          , R.T.,
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kerwin</surname>
            ,
            <given-names>S.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salazar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fedoroff</surname>
            ,
            <given-names>O.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Han</surname>
            ,
            <given-names>F.X.</given-names>
          </string-name>
          , Han,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Izbicka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            , and
            <surname>Von Hoff</surname>
          </string-name>
          , D.D.:
          <article-title>G-quadruplexes as targets for drug design</article-title>
          .
          <source>Pharmacol Ther</source>
          .
          <volume>85</volume>
          , (
          <issue>3</issue>
          ),
          <fpage>141</fpage>
          -
          <lpage>158</lpage>
          (
          <year>2000</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <surname>Ruggiero</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Richter</surname>
            ,
            <given-names>S.N.:</given-names>
          </string-name>
          <article-title>G-quadruplexes and G-quadruplex ligands: targets and tools in antiviral therapy</article-title>
          .
          <source>Nucleic Acids Res</source>
          .
          <volume>46</volume>
          , (
          <issue>7</issue>
          ),
          <fpage>3270</fpage>
          -
          <lpage>3283</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          37.
          <string-name>
            <surname>Shin</surname>
            ,
            <given-names>S.I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ham</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seo</surname>
            ,
            <given-names>S.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lim</surname>
            ,
            <given-names>C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jeon</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huh</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Roh</surname>
          </string-name>
          , T.Y.:
          <article-title>ZDNA-forming sites identified by ChIP-Seq are associated with actively transcribed regions in the human genome</article-title>
          .
          <source>DNA Res</source>
          .
          <article-title>(</article-title>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          38.
          <string-name>
            <surname>Jain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Vasquez</surname>
            ,
            <given-names>K.M.:</given-names>
          </string-name>
          <article-title>DNA triple helices: biological consequences and therapeutic potential</article-title>
          .
          <source>Biochimie</source>
          .
          <volume>90</volume>
          , (
          <issue>8</issue>
          ),
          <fpage>1117</fpage>
          -
          <lpage>1130</lpage>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>