Deep Learning Approach to Recognize Genome Functional Elements Using Diverse Genomic Data

Deep Learning Approach to Recognize Genome Functional Elements Using Diverse Genomic Data NazarBeknazarov Faculty of Computer Science Laboratory of Bioinformatics National Research University Higher School of Economics

11 Pokrovsky boulvar 101000 Moscow Russia

SeungminJin Faculty of Computer Science Laboratory of Bioinformatics National Research University Higher School of Economics

11 Pokrovsky boulvar 101000 Moscow Russia

MariaPoptsova Faculty of Computer Science Laboratory of Bioinformatics National Research University Higher School of Economics

11 Pokrovsky boulvar 101000 Moscow Russia

Deep Learning Approach to Recognize Genome Functional Elements Using Diverse Genomic Data 4E8AF824CB6FEFD0A1B81E1ECBA130B3 GROBID - A machine learning software for extracting information from scholarly documents DNA secondary structures histone code histone marks epigenetics machine learning deep learning convolutional neural networks recurrent neural networks

As a result of the revolution in genome sequencing a lot of -omics data were generated. After obtaining a primary genomic sequence the next major task is to study genomic regulatory code. Epigenetic data sets provide a hint of how regulatory patterns are distributed in different tissues. Other layer of genome regulatory code comprises DNA secondary structures, which can work as regulators of various genomic processes. Having Big Data from next-generation sequencing experiments, machine learning approaches were chosen to solve the task of recognizing genomic functional elements. The earlier attempts to solve the problems of genome annotation with different classes of functional ele-ments, i.e. nucleosomic DNA, exon-intron boundaries, enhancers used machine learning algorithms that required manual collection of different features needed to characterize genomic regions. Lately deep learning approaches including convolution neural networks and recurrent neural networks become successful in recognizing genomic functional elements based on sequence information on-ly and/or with additional information on epigenetics and known regulatory ele-ments. Here we discuss a deep learning approach and provide an example of building a deep learning model for the task of recognition of DNA secondary structures.

Introduction

Deep learning is becoming popular and easy to apply in solving various tasks. Among them, CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network) are the most popular deep learning architectures, which may show the state-of-the-art performance in the majority of applications [1]. This is achieved by the combination of the top performance in spatial and temporal dimen-sions. CNN may capture the hierarchical information in space. The mechanism of CNN is essentially in exploring a region of the input, one at a time, and mapping it to a specific feature space. By generating a series of convolutions at each region the network may learn the space features hierarchically [2]. For instance, for the task of face recognition, CNN starts to gather convolutions from lines or cir-cles in face images, and then it filters these features for building up the feature maps of nose, eyes, and ears, and finally it recognizes the face [3].

RNN can learn temporal order using its context, and additionally, being turing-complete, it may learn, theoretically, any kind of function [4]. Essentially RNN model keeps passing the context vector, which compresses the in-formation at a certain time step to predict outcome in the future time steps. It means RNN may handle arbitrary length of input [5]. This feature makes RNN useful in many sequential tasks, such as machine learning translation, time series prediction, speech recognition, and signal processing. However, in practice RNN does n ot work well alone, especially for the feature _______________________ extraction and long term prediction tasks [4,5]. This is why modulating CNN and RNN is a common practice and shows the best results in deep learning tasks [6][7][8].

In Bioinformatics, research in deep learning has been rapidly increasing since early 2000s and CNN and RNN are widely applied to various tasks [6]. For example, CNN applied to predict gene expression from epigenomic data, anomaly classification in biomedical imaging, brain decoding in biomedical signal processing [6]. RNN also was applied to protein structure classification, and anomaly classification in biomedical signal processing. Although combining two models in practice shows good performance, there is a tendency to use them separately in bioinformatics tasks [6]. One of the pioneering example of hybrid CNN and RNN model to predict function of the DNA sequence was implemented and tested in DanQ [7]. Another hybrid CNN-RNN model was applied for a task of predicting enhancers based on histone modification marks [8]. In this research, we continue testing deep learning approach combining two models to recognize genome functional elements using diverse genomic data.

As a genomic functional element we chose Z-DNA belonging to DNA secondary structures. The role of DNA secondary structures in the regulation of genomic processes was confirmed experimentally for quadruplexes, cruciform structures, triplexes, and Z-DNA. Experiments on wholegenome detection of Z-DNA regions are under development, and currently several experimental datasets are available [9,10]. Building and testing machine learning models that would aggregate information from experimental data is an urgent task, since there is a need for computer methods of genome annotation with functional elements. Here we tested several machine learning approaches including deep learning to detect Z-DNA regions. We showed that deep learning, and specifically hybrid CNN plus RNN models achieved the best performance in the task of Z-DNA recognition.

Material and Methods

Data on Z-DNA, epigenetics, RNA polymerase, and transcription factor binding sites

The positions of Z-DNA are taken from the dataset of the Chip-Seq experiment on identification of binding sites of the Zaa protein, which binds to the left-twisted form of DNA [10]. To improve the prediction quality of the sequence we added information on epigenetic and regulatory code. Histone marker positions and DNase hypersensitivity sites, which mark regions of an open chromatin, are taken from the international consortium project Roadmap Epigenomics [11]. Information on the binding sites of RNA polymerase and transcription factors are taken from the Encyclopedia of DNA elements (ENCODE) project [12]. Totally, 1065 features are selected.

DNA subsequence with Z-DNA regions is considered as an output vector. A binary value is assigned to every nucleotide depending on its location inside the Z-DNA region. We considered subsequences of 5000 bp, thus, every output vector has a length of 5000.

Construction of train and test datasets

We encoded human DNA sequence using one hot encoding method where a sequence is transformed to a binary matrix of 4xL where L is the length of the sequence and 4 rows correspond to the 4 nucleotides, TCAG. This matrix is filled with zeros and has only one value at the corresponding nucleotide cell in each position. Epigenomic data and RNA polymerase and transcription factors binding sites were added to the encoded DNA sequence. Finally, we create a set of matrices for every chromosome, which has the same length of DNA sequence. The shape of input matrix is 1069xL, where 1064 comes from additional features and 4 from one-hot encoded DNA, and L is the length of the sequence. In order to avoid any dependencies between Z-DNA sites and borders of DNA subsequences, DNA is uniformly divided into subsequences of length 5000. Then we split subsequence into train and test sets in a ratio of 4 to 1 respectively preserving the proportion of subsequences with Z-DNA in each set.

Machine learning models

Baseline model

In order to show the level of performance of deep learning models, we prepared a boosting classifier as a baseline. The term 'boosting' here means that it converts weak learners to strong learners. Basically, boosting is an ensemble method for improving the model predictions of any given learning algorithm. This method consists of sequential training of simple models, where each subsequent model corrects the errors of the previous one. Boosting is a well-known method in the bioinformatics domain and generally shows good results in many classification tasks [13-15].

Deep learning models

DNA has patterns in the form of one-dimensional sequence motifs, which CNN may capture very well, and, from the other hand, DNA is a text, so RNN may learn the context from it. Therefore, we expect the best result when we combine two models, CNN and RNN. For the proper comparison, we also trained independent CNN along with CNN + RNN.

CNN

We experimented with several hyperparameters for CNN models. We considered different sizes of the kernels and strides because it may influence the result. The number of output kernels was set to 1 and we use a softmax layer at the end. Thus, these models have a vector of outcome with length of input, each nucleotide corresponds to a probability value from 0 to 1. For each nucleotide, there are C boolean values, where C is kernel size. Every boolean value depicts the presence of Z-DNA in this very point. Averaging on these C values was used as a target for the outcome cell. Since the padding is absent, the number of outcomes of the models equals the number of averaged values. That means each model will predict the average number of nucleotides that occurred in a given segment, and assign this number to the middle of the segment. Increasing layer number or kernel size make worse its complexity but may have better results. Next set of models has more convolutional layers with ReLU activation. In this case, the target variable is calculated in a slightly different way. Averaging is performed by the size of the last layer. The size and number of kernels on the first and second layers were selected from a predefined set of values.

CNN+RNN

This type of hybrid model was successfully implemented in the DanQ [7]. CNN extracts important motifs and simultaneously RNN can learn complex regulatory grammar between the motifs. It is assumed that the motifs that were detected by the CNN layer also have recurrent dependencies. In theory, such a network is able to recognize a succession of motifs on which Z-DNA configuration depends. The model architecture used for Z-DNA detection is shown in Fig. 1. There are several ways to use RNN: one-to-one, one-to-many, many-to-one, and many-to-many (Fig. 2). In this paper, we considered two approaches, many-to-many and many-to-one.

Approach many-to-one

In this case, the structure of a model is as follows. The first part of the model is one or several CNN layers, and each column of the received out-put is separately transferred to the RNN network. In our case, a multi-layer bidirectional LSTM is selected for RNN. Next, the number of layers in the CNN and LSTM parts will be selected. The sizes of kernels and hidden layers will be selected. At the end and beginning of the sequence, the RNN layer will output 2 vectors that are associated with longterm LSTM memory cells. Two LSTM context vectors were included since this RNN model is bidirectional. Then the vectors are passed to the fully connected layer, which makes the prediction. The target variable is a boolean value of Z-DNA presence in the region in this sequence.

Approach many-to-many

This architecture completely copies the previous one, except for one element. After the RNN layer, the output of the long-term memory element is ignored and the short-term memory outputs of each direction are aggregated. Next, each unit of the sequence corresponds to two vectors, which are passed to the fully connected layer and then predictions are made for each part of the sequence. The target variable in this case will be calculated exactly as in the case of CNN. That is, each unit of the sequence will be mapped to the average of a certain region of the chain.

Results

Quantiles were calculated for the distribution of random AUC using bootstrap sampling (Table 1). You can see that the first model has a rather low quality, indistinguishable from that of a random choice. The best CNN model among all showed 69 AUC on test set. The architecture can be listed as follows. For the best CNN model, the first layer is a convolutional layer with 36 kernels, kernels size 13, stride 2 and padding 6. Second layer is a ReLU. Third layer is a convolutional layer with 2 kernels, kernels size 13, stride 2 and padding 6. Last layer is a Sigmoid. The performance of the hybrid CNN+RNN showed quality higher than CNN model. Best model with a many-to-one approach showed 86.5 AUC. The architecture of the best CNN+RNN model can be listed as follows. The first layer is a convolutional layer with 64 kernels, kernels size 13, stride 4 and padding 6.Second layer is a ReLU. Output of ReLU was sent to bidirectional LSTM layer with hidden size 64 and 2 layers. Hidden state of LSTM goes to the dropout layer with probability 0.7. Last fully connected layer has 2 neurons.

The best model with a many-to-many approach showed 80.5 AUC. First layer is a convolutional layer with 36 kernels, kernels size 25, stride 2 and padding 12. Second layer is a ReLU. Third layer is a convolutional layer with 64 kernels, kernels size 25, stride 2 and padding 12.Fourth layer is a ReLU. Output of ReLU was sent to bidirectional LSTM layer with hidden size 64 and 2 layers. Hidden state of LSTM goes to the dropout layer with probability 0.7. Last fully connected layer has 2 neurons.

Conclusions and Discussion

The following conclusions can be drawn from the obtained results. Although CNN model shows higher performance than the baseline, it does not handle the sequential nature of DNA sequence. Baseline and CNN models perform much worse than a model that contains an RNN layer. The maximum quality that can be achieved on this dataset with the power of this set of architectures does not exceed 86 % of the AUC, which indicates that the task can be solved using available data.

Here we presented results of a deep learning approach for the Z-DNA prediction, in particular a hybrid model of two famous deep learning network architectures -CNN and RNN. This architecture outperforms both models based only on CNN and classical machine learning models such as gradient boosting. As we expected CNN + RNN shows better results than CNN because RNN may capture the sequential pattern using its context. We assume our approach may be applied to many other bioinformatics tasks, which are required for mapping spatial data to sequential output.

One of the advantages of our approach is scalability, where we can upgrade the system when more epigenetics and regulatory data become available. Thus, the same type of models can be applied to recognition of quadruplexes or triplexes as well as patterns of association of DNA secondary structures and epigenetic code. We expect that inclusion of omics data will improve prediction quality of the model. However there is a drawback in having a large feature space that will increase the time of mod-el training. It would be beneficial first to find a minimal set that would achieve the desired model quality and then train the model with the reduced size of feature space. It will also help to find scientifically important associations between studied functional and epigenetic and/or regulatory elements.

Deep neural networks are capable of processing effectively aggregated information from different levels of genome organization. At the present time, when next-generation sequencing experiments are still too expensive, machine learning models for annotating genomes with functional genomic elements are very important. For some species next-generation sequencing experiments on epigenomic and regulatory code are not available at all. Finding de novo or imputed novel functional elements with computational artificial intelligence systems would help researchers in understanding principles and mechanisms of genome functioning.

Modeling and Analysis of Complex Systems and Processes -MACSPro'2020, October 22-24, 2020, Venice, Italy & Moscow, Russia EMAIL: nazar.s.beknazarov@gmail.com (A. 1); mpoptsova@hse.ru (A. 3) ORCID: 0000-0002-7198-8234 (A. 3);

Figure 1 :1Figure 1: Architecture of a hybrid model, CNN + RNN for Z-DNA prediction. DNA sequence data transformed with one-hot encoding was concatenated with sparse vectors of epigenomic data.

Figure 2 :2Figure 2: Schematic representation of approaches for the classification using RNN architecture.

Table 11Experiment resultModelAUCAccuracyBoosting0.5320.691CNN0.690.55CNN+RNN0.8650.75

A comprehensive review for industrial applicability of artificial neural networks MR GMeireles PE MAlmeida MGSimoes IEEE Transactions on Industrial Electronics 50 2003 Object recognition with gradient-based learning In YLecun PHaffner LBottou YBengio Shape, contour and grouping in computer vision DAForsyth JLMundy VDGesú RCipolla

Berlin, Heidelberg

Springer 1999 When face recognition meets with deep learning: an evaluation of convolutional neural networks for face recognition GHu YYang DYi JKittler WChristmas SZLi THospedales Proceedings of the IEEE international conference on computer vision workshops the IEEE international conference on computer vision workshops 2015 Deep learning IGoodfellow YBengio ACourville 2016 MIT press Long short-term memory SHochreiter JSchmidhuber Neural computation 9 8 1997 Video-based emotion recognition using CNN-RNN and C3D hybrid networks YFan XLu DLi YLiu Proceedings of the 18th ACM International Conference on Multimodal Interaction the 18th ACM International Conference on Multimodal Interaction 2016 Cnn-rnn: A unified framework for multi-label image classification JWang YYang JMao ZHuang CHuang WXu Proceedings of the IEEE conference on computer vision and pattern recognition the IEEE conference on computer vision and pattern recognition 2016 Stock price prediction using LSTM, RNN and CNN-sliding window model SSelvin RVinayakumar EAGopalakrishnan VKMenon KPSoman international conference on advances in computing, communications and informatics (icacci) IEEE 2017 Deep learning in bioinformatics SMin BLee SYoon Brief Bioinform 18 2017 DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences DQuang XXie Nucleic Acids Res 44 e107 2016 Enhancer prediction with histone modification marks using a hybrid neural network model ALim SLim SKim Methods 166 2019 Permanganate/S1 Nuclease Footprinting Reveals Non-B DNA Structures with Regulatory Potential across a Mammalian Genome FKouzine DWojtowicz LBaranello AYamane SNelson WResch KRKieffer-Kwon CJBenham RCasellas TMPrzytycka DLevens Cell Syst 4 2017 Z-DNAforming sites identified by ChIP-Seq are associated with actively transcribed regions in the human genome SIShin SHam JPark SHSeo CHLim HJeon JHuh TYRoh DNA Res 2016 Integrative analysis of 111 reference human epigenomes CRoadmap Epigenomics AKundaje WMeuleman JErnst MBilenky AYen AHeravi-Moussavi PKheradpour ZZhang JWang MJZiller VAmin JWWhitaker MDSchultz LDWard ASarkar GQuon RSSandstrom MLEaton YCWu ARPfenning XWang MClaussnitzer YLiu CCoarfa RAHarris NShoresh CBEpstein EGjoneska DLeung WXie RDHawkins RLister CHong PGascard AJMungall RMoore EChuah ATam TKCanfield RSHansen RKaul PJSabo MSBansal ACarles JRDixon KHFarh SFeizi RKarlic ARKim AKulkarni DLi RLowdon GElliott TRMercer SJNeph VOnuchic PPolak NRajagopal PRay RCSallari KTSiebenthall NASinnott-Armstrong MStevens REThurman JWu BZhang XZhou AEBeaudet LABoyer PLDe Jager PJFarnham SJFisher DHaussler SJJones WLi MAMarra MTMcmanus SSunyaev JAThomson TDTlsty LHTsai WWang RAWaterland MQZhang LHChadwick BEBernstein JFCostello JREcker MHirst AMeissner AMilosavljevic BRen JAStamatoyannopoulos TWang MKellis Nature 518 2015 The Encyclopedia of DNA elements (ENCODE): data portal update CADavis BCHitz CASloan ETChan JMDavidson IGabdank JAHilton KJain UKBaymuradov AKNarayanan KCOnate KGraham SRMiyasato TRDreszer JSStrattan OJolanki FYTanaka JMCherry Nucleic Acids Res 46 2018 Model-based boosting in high dimensions THothorn PBuhlmann Bioinformatics 22 2006 Boosting for tumor classification with gene expression data MDettling PBuhlmann Bioinformatics 19 2003 Predicting protein residue-residue contacts using deep networks and boosting JEickholt JCheng Bioinformatics 28 2012