<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Min, S., Lee, B., Yoon, S.: Deep learning in bioinformatics. Brief Bioinform</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Deep Learning Approach to Recognize Elements Using Diverse Genomic Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nazar Beknazarov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Seungmin Jin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Poptsova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Laboratory of Bioinformatics, Faculty of Computer Science, National Research University Higher School of Economics</institution>
          ,
          <addr-line>11 Pokrovsky boulvar, Moscow</addr-line>
          ,
          <country country="RU">Russia 101000</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>18</volume>
      <issue>2017</issue>
      <fpage>1643</fpage>
      <lpage>1647</lpage>
      <abstract>
        <p>As a result of the revolution in genome sequencing a lot of -omics data were generated. After obtaining a primary genomic sequence the next major task is to study genomic regulatory code. Epigenetic data sets provide a hint of how regulatory patterns are distributed in different tissues. Other layer of genome regulatory code comprises DNA secondary structures, which can work as regulators of various genomic processes. Having Big Data from next-generation sequencing experiments, machine learning approaches were chosen to solve the task of recognizing genomic functional elements. The earlier attempts to solve the problems of genome annotation with different classes of functional ele-ments, i.e. nucleosomic DNA, exon-intron boundaries, enhancers used machine learning algorithms that required manual collection of different features needed to characterize genomic regions. Lately deep learning approaches including convolution neural networks and recurrent neural networks become successful in recognizing genomic functional elements based on sequence information on-ly and/or with additional information on epigenetics and known regulatory ele-ments. Here we discuss a deep learning approach and provide an example of building a deep learning model for the task of recognition of DNA secondary structures.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;DNA secondary structures</kwd>
        <kwd>histone code</kwd>
        <kwd>histone marks</kwd>
        <kwd>epigenetics</kwd>
        <kwd>machine learning</kwd>
        <kwd>deep learning</kwd>
        <kwd>convolutional neural networks</kwd>
        <kwd>recurrent neural networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Deep learning is becoming popular and easy to apply in solving various tasks. Among them, CNN
(Convolutional Neural Network) and RNN (Recurrent Neural Network) are the most popular deep
learning architectures, which may show the state-of-the-art performance in the majority of
applications [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This is achieved by the combination of the top performance in spatial and temporal
dimen-sions. CNN may capture the hierarchical information in space. The mechanism of CNN is
essentially in exploring a region of the input, one at a time, and mapping it to a specific feature space.
By generating a series of convolutions at each region the network may learn the space features
hierarchically [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For instance, for the task of face recognition, CNN starts to gather convolutions
from lines or cir-cles in face images, and then it filters these features for building up the feature maps
of nose, eyes, and ears, and finally it recognizes the face [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        RNN can learn temporal order using its context, and additionally, being turing-complete, it may
learn, theoretically, any kind of function [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Essentially RNN model keeps passing the context vector,
which compresses the in-formation at a certain time step to predict outcome in the future time steps. It
means RNN may handle arbitrary length of input [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This feature makes RNN useful in many
sequential tasks, such as machine learning translation, time series prediction, speech recognition, and
signal processing. However, in practice RNN does n ot work well alone, especially for the feature
_______________________
Modeling and Analysis of Complex Systems and Processes - MACSPro’2020, October 22–24, 2020, Venice, Italy &amp; Moscow, Russia
EMAIL: nazar.s.beknazarov@gmail.com (A. 1); mpoptsova@hse.ru (A. 3)
ORCID: 0000-0002-7198-8234 (A. 3);
      </p>
      <p>© 2020 Copyright for this paper by its authors.</p>
      <p>
        CPWErooUrckResehdoinpgs hIStSpN:/c1e6u1r3-w-0s.o7r3g UCsEe UpeRrmiWttedorukndsehroCprePatrivoecCeoemdminognssL(iCceEnsUe RAt-trWibuSti.oonr4g.)0 International (CC BY 4.0).
extraction and long term prediction tasks [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. This is why modulating CNN and RNN is a common
practice and shows the best results in deep learning tasks [
        <xref ref-type="bibr" rid="ref6 ref7">6-8</xref>
        ].
      </p>
      <p>
        In Bioinformatics, research in deep learning has been rapidly increasing since early 2000s and
CNN and RNN are widely applied to various tasks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. For example, CNN applied to predict gene
expression from epigenomic data, anomaly classification in biomedical imaging, brain decoding in
biomedical signal processing [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. RNN also was applied to protein structure classification, and
anomaly classification in biomedical signal processing. Although combining two models in practice
shows good performance, there is a tendency to use them separately in bioinformatics tasks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. One
of the pioneering example of hybrid CNN and RNN model to predict function of the DNA sequence
was implemented and tested in DanQ [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Another hybrid CNN-RNN model was applied for a task of
predicting enhancers based on histone modification marks [8]. In this research, we continue testing
deep learning approach combining two models to recognize genome functional elements using diverse
genomic data.
      </p>
      <p>As a genomic functional element we chose Z-DNA belonging to DNA secondary structures. The
role of DNA secondary structures in the regulation of genomic processes was confirmed
experimentally for quadruplexes, cruciform structures, triplexes, and Z-DNA. Experiments on
wholegenome detection of Z-DNA regions are under development, and currently several experimental
datasets are available [9, 10]. Building and testing machine learning models that would aggregate
information from experimental data is an urgent task, since there is a need for computer methods of
genome annotation with functional elements. Here we tested several machine learning approaches
including deep learning to detect Z-DNA regions. We showed that deep learning, and specifically
hybrid CNN plus RNN models achieved the best performance in the task of Z-DNA recognition.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Material and Methods</title>
    </sec>
    <sec id="sec-3">
      <title>2.1. Data on Z-DNA, epigenetics, RNA polymerase, and transcription factor binding sites</title>
      <p>The positions of Z-DNA are taken from the dataset of the Chip-Seq experiment on identification of
binding sites of the Zaa protein, which binds to the left-twisted form of DNA [10]. To improve the
prediction quality of the sequence we added information on epigenetic and regulatory code. Histone
marker positions and DNase hypersensitivity sites, which mark regions of an open chromatin, are
taken from the international consortium project Roadmap Epigenomics [11]. Information on the
binding sites of RNA polymerase and transcription factors are taken from the Encyclopedia of DNA
elements (ENCODE) project [12]. Totally, 1065 features are selected.</p>
      <p>DNA subsequence with Z-DNA regions is considered as an output vector. A binary value is
assigned to every nucleotide depending on its location inside the Z-DNA region. We considered
subsequences of 5000 bp, thus, every output vector has a length of 5000.
2.2.</p>
    </sec>
    <sec id="sec-4">
      <title>Construction of train and test datasets</title>
      <p>We encoded human DNA sequence using one hot encoding method where a sequence is
transformed to a binary matrix of 4xL where L is the length of the sequence and 4 rows correspond to
the 4 nucleotides, TCAG. This matrix is filled with zeros and has only one value at the corresponding
nucleotide cell in each position. Epigenomic data and RNA polymerase and transcription factors
binding sites were added to the encoded DNA sequence. Finally, we create a set of matrices for every
chromosome, which has the same length of DNA sequence. The shape of input matrix is 1069xL,
where 1064 comes from additional features and 4 from one-hot encoded DNA, and L is the length of
the sequence. In order to avoid any dependencies between Z-DNA sites and borders of DNA
subsequences, DNA is uniformly divided into subsequences of length 5000. Then we split
subsequence into train and test sets in a ratio of 4 to 1 respectively preserving the proportion of
subsequences with Z-DNA in each set.
2.3.</p>
    </sec>
    <sec id="sec-5">
      <title>Machine learning models</title>
    </sec>
    <sec id="sec-6">
      <title>2.3.1. Baseline model</title>
      <p>In order to show the level of performance of deep learning models, we prepared a boosting
classifier as a baseline. The term ‘boosting’ here means that it converts weak learners to strong
learners. Basically, boosting is an ensemble method for improving the model predictions of any given
learning algorithm. This method consists of sequential training of simple models, where each
subsequent model corrects the errors of the previous one. Boosting is a well-known method in the
bioinformatics domain and generally shows good results in many classification tasks [13-15].</p>
    </sec>
    <sec id="sec-7">
      <title>2.3.2. Deep learning models</title>
      <p>DNA has patterns in the form of one-dimensional sequence motifs, which CNN may capture very
well, and, from the other hand, DNA is a text, so RNN may learn the context from it. Therefore, we
expect the best result when we combine two models, CNN and RNN. For the proper comparison, we
also trained independent CNN along with CNN + RNN.
2.3.3. CNN</p>
      <p>We experimented with several hyperparameters for CNN models. We considered different
sizes of the kernels and strides because it may influence the result. The number of output kernels was
set to 1 and we use a softmax layer at the end. Thus, these models have a vector of outcome with
length of input, each nucleotide corresponds to a probability value from 0 to 1. For each nucleotide,
there are C boolean values, where C is kernel size. Every boolean value depicts the presence of
ZDNA in this very point. Averaging on these C values was used as a target for the outcome cell. Since
the padding is absent, the number of outcomes of the models equals the number of averaged values.
That means each model will predict the average number of nucleotides that occurred in a given
segment, and assign this number to the middle of the segment. Increasing layer number or kernel size
make worse its complexity but may have better results. Next set of models has more convolutional
layers with ReLU activation. In this case, the target variable is calculated in a slightly different way.
Averaging is performed by the size of the last layer. The size and number of kernels on the first and
second layers were selected from a predefined set of values.</p>
      <p>
        This type of hybrid model was successfully implemented in the DanQ [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. CNN extracts important
motifs and simultaneously RNN can learn complex regulatory grammar between the motifs. It is
assumed that the motifs that were detected by the CNN layer also have recurrent dependencies. In
theory, such a network is able to recognize a succession of motifs on which Z-DNA configuration
depends. The model architecture used for Z-DNA detection is shown in Fig. 1.
      </p>
      <p>There are several ways to use RNN: one-to-one, one-to-many, many-to-one, and many-to-many
(Fig. 2). In this paper, we considered two approaches, many-to-many and many-to-one.</p>
    </sec>
    <sec id="sec-8">
      <title>2.3.5. Approach many-to-one</title>
      <p>In this case, the structure of a model is as follows. The first part of the model is one or several
CNN layers, and each column of the received out-put is separately transferred to the RNN network. In
our case, a multi-layer bidirectional LSTM is selected for RNN. Next, the number of layers in the
CNN and LSTM parts will be selected. The sizes of kernels and hidden layers will be selected. At the
end and beginning of the sequence, the RNN layer will output 2 vectors that are associated with
longterm LSTM memory cells. Two LSTM context vectors were included since this RNN model is
bidirectional. Then the vectors are passed to the fully connected layer, which makes the prediction.
The target variable is a boolean value of Z-DNA presence in the region in this sequence.</p>
    </sec>
    <sec id="sec-9">
      <title>2.3.6. Approach many-to-many</title>
      <p>This architecture completely copies the previous one, except for one element. After the RNN layer,
the output of the long-term memory element is ignored and the short-term memory outputs of each
direction are aggregated. Next, each unit of the sequence corresponds to two vectors, which are
passed to the fully connected layer and then predictions are made for each part of the sequence. The
target variable in this case will be calculated exactly as in the case of CNN. That is, each unit of the
sequence will be mapped to the average of a certain region of the chain.</p>
    </sec>
    <sec id="sec-10">
      <title>3. Results</title>
      <p>Quantiles were calculated for the distribution of random AUC using bootstrap sampling (Table 1).
You can see that the first model has a rather low quality, indistinguishable from that of a random
choice. The best CNN model among all showed 69 AUC on test set. The architecture can be listed as
follows. For the best CNN model, the first layer is a convolutional layer with 36 kernels, kernels size
13, stride 2 and padding 6. Second layer is a ReLU. Third layer is a convolutional layer with 2
kernels, kernels size 13, stride 2 and padding 6. Last layer is a Sigmoid. The performance of the hybrid
CNN+RNN showed quality higher than CNN model.</p>
      <p>Best model with a many-to-one approach showed 86.5 AUC. The architecture of the best
CNN+RNN model can be listed as follows. The first layer is a convolutional layer with 64 kernels,
kernels size 13, stride 4 and padding 6.Second layer is a ReLU. Output of ReLU was sent to
bidirectional LSTM layer with hidden size 64 and 2 layers. Hidden state of LSTM goes to the dropout
layer with probability 0.7. Last fully connected layer has 2 neurons.</p>
      <p>The best model with a many-to-many approach showed 80.5 AUC. First layer is a convolutional
layer with 36 kernels, kernels size 25, stride 2 and padding 12. Second layer is a ReLU. Third layer is
a convolutional layer with 64 kernels, kernels size 25, stride 2 and padding 12.Fourth layer is a ReLU.
Output of ReLU was sent to bidirectional LSTM layer with hidden size 64 and 2 layers. Hidden state
of LSTM goes to the dropout layer with probability 0.7. Last fully connected layer has 2 neurons.</p>
    </sec>
    <sec id="sec-11">
      <title>4. Conclusions and Discussion</title>
      <p>The following conclusions can be drawn from the obtained results. Although CNN model shows
higher performance than the baseline, it does not handle the sequential nature of DNA sequence.
Baseline and CNN models perform much worse than a model that contains an RNN layer. The
maximum quality that can be achieved on this dataset with the power of this set of architectures does
not exceed 86 % of the AUC, which indicates that the task can be solved using available data.</p>
      <p>Here we presented results of a deep learning approach for the Z-DNA prediction, in particular a
hybrid model of two famous deep learning network architectures - CNN and RNN. This architecture
outperforms both models based only on CNN and classical machine learning models such as gradient
boosting. As we expected CNN + RNN shows better results than CNN because RNN may capture the
sequential pattern using its context. We assume our approach may be applied to many other
bioinformatics tasks, which are required for mapping spatial data to sequential output.</p>
      <p>One of the advantages of our approach is scalability, where we can upgrade the system when more
epigenetics and regulatory data become available. Thus, the same type of models can be applied to
recognition of quadruplexes or triplexes as well as patterns of association of DNA secondary
structures and epigenetic code. We expect that inclusion of omics data will improve prediction quality
of the model. However there is a drawback in having a large feature space that will increase the time
of mod-el training. It would be beneficial first to find a minimal set that would achieve the desired
model quality and then train the model with the reduced size of feature space. It will also help to find
scientifically important associations between studied functional and epigenetic and/or regulatory
elements.</p>
      <p>Deep neural networks are capable of processing effectively aggregated information from different
levels of genome organization. At the present time, when next-generation sequencing experiments are
still too expensive, machine learning models for annotating genomes with functional genomic
elements are very important. For some species next-generation sequencing experiments on
epigenomic and regulatory code are not available at all. Finding de novo or imputed novel functional
elements with computational artificial intelligence systems would help researchers in understanding
principles and mechanisms of genome functioning.</p>
    </sec>
    <sec id="sec-12">
      <title>5. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Meireles</surname>
            ,
            <given-names>M.R.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Almeida</surname>
            ,
            <given-names>P.E.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simoes</surname>
            ,
            <given-names>M.G.</given-names>
          </string-name>
          :
          <article-title>A comprehensive review for industrial applicability of artificial neural networks</article-title>
          .
          <source>IEEE Transactions on Industrial Electronics</source>
          <volume>50</volume>
          , (
          <year>2003</year>
          )
          <fpage>585</fpage>
          -
          <lpage>601</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>LeCun</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haffner</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Object recognition with gradient-based learning</article-title>
          <source>In: Forsyth</source>
          ,
          <string-name>
            <given-names>D.A.</given-names>
            ,
            <surname>Mundy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.L.</given-names>
            ,
            <surname>Gesú</surname>
          </string-name>
          , V.d.,
          <string-name>
            <surname>Cipolla</surname>
            ,
            <given-names>R</given-names>
          </string-name>
          . (eds.)
          <article-title>Shape, contour and grouping in computer vision</article-title>
          , pp.
          <fpage>319</fpage>
          -
          <lpage>345</lpage>
          . Springer, Berlin, Heidelberg (
          <year>1999</year>
          .)
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Hu</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kittler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christmas</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>S.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hospedales</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>When face recognition meets with deep learning: an evaluation of convolutional neural networks for face recognition</article-title>
          . .
          <source>Proceedings of the IEEE international conference on computer vision workshops</source>
          . (
          <year>2015</year>
          )
          <fpage>142</fpage>
          -
          <lpage>150</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Deep learning</article-title>
          . MIT press (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          . .
          <source>Neural computation</source>
          <volume>9</volume>
          (
          <issue>8</issue>
          ), (
          <year>1997</year>
          )
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          , :
          <article-title>Video-based emotion recognition using CNN-RNN and C3D hybrid networks</article-title>
          .
          <source>Proceedings of the 18th ACM International Conference on Multimodal Interaction</source>
          (
          <year>2016</year>
          )
          <fpage>445</fpage>
          -
          <lpage>450</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Cnn-rnn: A unified framework for multi-label image classification</article-title>
          .
          <source>Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          (
          <year>2016</year>
          )
          <fpage>2285</fpage>
          -
          <lpage>2294</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>