Analysis of NGS Data on the Transcriptional Regulation1 Semyon K. Kolmykov1,2,3, Ivan S. Evshin1,2, Fedor A. Kolpakov1,2 1 Institute of Computational Technologies SB RAS, Novosibirsk, Russian Federation 2 BIOSOFT.RU, LLC, Novosibirsk, Russian Federation 3 Institute of Cytology and Genetics SB RAS, Novosibirsk, Russian Federation Abstract. The GTRD database (http://gtrd.biouml.org) contains over 30,000 uniformly processed NGS experiments on transcriptional regulation (ChIP-seq, ChIP-exo, DNase-seq, ATAC-seq, MNase-seq and FAIRE-seq). To process these types of data, pipelines have been developed for the eGrid distributed computing management system and the BioUML platform. Keywords: GTRD; BioUML; NGS; transcriptional regulation; ChIP-seq; ChIP-exo; DNase- seq; ATAC-seq; FAIRE-seq. 1 Introduction The GTRD database (Gene Transcription Factors Database, http://gtrd.biouml.org) [1] contains information on the main components of transcriptional regulation: transcription factor binding sites and open chromatin regions. These data have been experimentally identified in various cell types and tissues using high-throughput methods: ChIP-seq and DNase- seq, respectively. Nowadays, GTRD database is the largest database in terms of size of uniformly annotated and processed ChIP-seq experiments on TFBSs identification (http://wiki.biouml.org/index.php/GTRD_comparison). The Statistics section on the GTRD start page provides detailed information on the number of analyzed data for each type of NGS experiment. For the further development of the GTRD database, pipelines of processing NGS data of various types of experiments were built: ● ChIP-seq - mapping of transcription factor binding sites (TFBSs) and various types of modifications of histones (HM) on the reference genome; ● ChIP-exo - TFBSs mapping; ● DNase-seq, ATAC-seq, MNase-seq and FAIRE-seq - mapping of open chromatin regions on the reference genome, as well as localization of individual nucleosomes. To process these types of NGS experiments, corresponding pipelines were developed for: ● eGrid distributed computing management systems. This system was designed to control the distributed processing of NGS data included in the GTRD on a cluster of 12 servers (each - 2 Intel® Xeon® CPU X5650 processors, 32-48GB RAM). The pipelines described below were implemented in the form of programs written in Java language with a set of software packages necessary for NGS data processing. ● The BioUML platform - web-platform for the analysis of biomedical data (https://ict.biouml.org). BioUML includes a wide range of capabilities, including access to databases with experimental data, tools for a formalized description of the structure and functioning of biological systems, as well as tools for their visualization, modeling, selection of parameters and analysis [2]. In this case, the developed scenarios are implemented as pipelines (workflows) and, unlike eGrid, are available to all users of this platform. Each user has the ability to change the parameters of the analyzes used in their own copy of the workflow. Also, due to the graphical representation of workflows and the available functionality of the workflow editor, users have the ability to both change the structure and build their own workflows for data processing. 2 Materials and Methods In developing these scenarios, we followed the recommendations of the ENCODE project [3]. In the developed pipelines, the first steps of data processing are similar. At first, an initial quality analysis of raw data is performed using the FastQC software package (http://www.bioinformatics.babraham.ac.uk/projects/fastqc) [4]. If necessary, adapter sequences are deleted using Trimmomatic package [5]. Alignment of the raw reads to the reference genome is performed using bowtie2 (bowtie2 --seed 0) [6]. Subsequent filtering of the obtained alignment (MAPQ ≥ 10) and sorting Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 19 by coordinate value are performed using the samtools software package: “samtools view -bq 10” and “samtools sort”, respectively. If data is represented by a library of double-end readings, then PCR duplicates are removed by Picard MarkDuplicates (https://broadinstitute.github.io/picard) [7]. The next step of processing is searching for genome regions enriched with aligned reads (peak calling). To this date, more than 30 peak calling algorithms have been published [8]. Since there is a difference in the distribution of aligned reads on the reference genome for different types of NGS data, the efficiency of peak calling depends on both the selected peakcaller and a set of its initial parameters. Therefore, this stage is specific for each type of NGS experiments under consideration. Figure 1. General pipeline of NGS data processing. The ChIP-seq method is used for the whole genome searching for transcription factor binding sites (TFBSs) and various types of histone modifications (HM). When searching for TFBSs in ChIP-seq experiments represented by a library of double-end reads, only the first mate of aligned read pairs (mate-1; “samtools view -F128”) is selected for subsequent analysis. TFBSs are identified using the following set of peakcallers: MACS2 [9], GEM [10], SISSRS [11] and PICS [12]. It is worth noting that when using MACS2, the Phantompeakqualtools package [13] is used for modeling the shift size and calculation of the fragment length. In the case of processing ChIP-seq experiments on various types of HMs identifying, a set of initial parameters of the software used depends on the type of HMs (see Table 1). For wide marks, MACS2 starts up with the following parameters: “macs2 callpeak --broad --broad-cutoff 0.1”, and the default parameters are used to search for narrow marks. MOSAiCS method for narrow marks and MOSAiCS-HMM method for broad peaks are also used for HMs identification [14]. Table 1. Types of histone modifications (https://www.encodeproject.org/chip-seq/histone/#histone) To map nucleosomes on a reference genome based on processing data from MNase-seq experiments, the DANPOS2 software package (“danpos.py dpos”) is used [15]. A group of methods, consisting of DNase-seq, ATAC-seq, and FAIRE-seq, is used for mapping of open chromatin regions on a reference genome, thereby identifying potential regulatory regions. MACS2 and Hotspot2 (https://github.com/Altius/hotspot2) are used for identification of regions with high sensitivity to cleavage by endonuclease DNase I [16]. Due to the differences in preparation of the libraries, for single-cut DNase-seq experiments, MACS2 was used with the initial parameters: “–nomodel –shift -100 –extsize 200”; for other cases, the default parameters were used. To map open chromatin regions based on ATAC-seq or FAIRE-seq experiments, MACS2 is also used, but in the vast majority of cases (when the data is represented by a pair-reading library), this method starts with the parameters “--keep-dup all -f BAMPE”, and in the rest with “--keep-dup all –shift 100 –extsize 200”. Also, Genrich (https://github.com/jsh58/Genrich) [17] is used to process data from ATAC-seq experiments. The final step in the 20 processing of DNase-seq experiments is the identification of the DNase I footprints (putative areas of protein-DNA interaction; small regions of DNase-seq profiles with reduced sensitivity to cleavage by DNase I and location in regions with high activity of this enzyme). This step is performed using Wellington [18]. Analysis of the quality of NGS data begins with an assessment of the quality of the raw reads using the FastQC program. At the next stage, the alignment quality of the initial reads is estimated. In addition to calculating basic alignment statistics (samtools flagstat), the complexity of the library (NRF, PBC1 and PBC2) and the signal-to-noise ratio (based on cross-correlation (NSC and RSC) estimated using Phantompeakqualtools package [13]) are analysed. Upon completion of the peak calling stage, the percentage of reads that fall into the obtained TFBSs (FRiP) is calculated. In the case of processing ChIP-seq experiments, the data processing process was completed by evaluating FPCM and FNCM values [19]. In the future, it is planned to expand the set of methods for quality analysis of NGS data. 3 Results As a part of the GTRD database development project, the scenarios for NGS data processing integrated into the eGrid distributed computing management system were used in preparation of the current release of the database. The information on the current release statistics is available on GTRD starting page (http://gtrd.biouml.org). Furthermore, the software packages used for processing NGS data were installed on the BioUML platform using the Galaxy interface. To combine this set of programs into pipelines for processing various types of NGS experiments, we used workflow edition tools available on BioUML platform. Thus, 7 scenarios for processing NGS experiment data were built: “ChIP-seq TF pipeline”, “ChIP-seq HM pipeline”, “ChIP-exo pipeline”, “DNase-seq pipeline”, “ATAC-seq pipeline”, “MNase-seq pipeline” and “FAIRE-seq pipeline”. It is worth noting that it is possible for users to change both the initial parameters of the software packages used in workflow and the structure of the pipelines themselves. Figure 2. Diagram representation of “ATAC-seq pipeline” on BioUML platform. Green blocks - initial parameters of workflow; blue blocks - analyses in use; orange blocks - files generated by analyses in use. 4 Conclusion Thus, we compiled 7 scenarios for processing different types of NGS data: ChIP-seq TF, ChIP-seq HM, ChIP-exo, DNase-seq, ATAC-seq, MNase-seq and FAIRE-seq. The developed data processing scenarios were used in preparation of the current release of database of regulatory elements (GTRD). Also, these scenarios were implemented as workflows on BioUML platform. Acknowledgements. This work was supported by the Russian Science Foundation, grant number: 19-14-00295. References [1] Yevshin I. et al. GTRD: a database on gene transcription regulation—2019 update //Nucleic acids research. – 2018. – Т. 47. – №. D1. – С. D100-D105. 21 [2] Kolpakov F. et al. BioUML: an integrated environment for systems biology and collaborative analysis of biomedical data //Nucleic acids research. – 2019. – Т. 47. – №. W1. – С. W225-W233. [3] ENCODE Project Consortium et al. An integrated encyclopedia of DNA elements in the human genome //Nature. – 2012. – Т. 489. – №. 7414. – С. 57. [4] Andrews S. FastQC: a quality control tool for high throughput sequence data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc [5] Bolger A. M., Lohse M., Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data // Bioinformatics. – 2014. – Т. 30. – №. 15. – С. 2114-2120. [6] Langmead B., Salzberg S. L. Fast gapped-read alignment with Bowtie 2 // Nature methods. – 2012. – Т. 9. – №. 4. – С. 357. [7] Broad Institute. Picard Toolkit: Java command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. https://github.com/broadinstitute/picard. [8] Thomas R. et al. Features that define the best ChIP-seq peak calling algorithms //Briefings in bioinformatics. – 2017. – Т. 18. – №. 3. – С. 441-450 [9] Zhang Y. et al. Model-based analysis of ChIP-Seq (MACS) // Genome biology. – 2008. – Т. 9. – №. 9. – С. R137. [10] Guo Y., Mahony S., Gifford D. K. High resolution genome wide binding event finding and motif discovery reveals transcription factor spatial binding constraints //PLoS computational biology. – 2012. – Т. 8. – №. 8. – С. e1002638. [11] Narlikar L., Jothi R. ChIP-Seq data analysis: identification of Protein–DNA binding sites with SISSRs peak-finder //Next Generation Microarray Bioinformatics. – Humana Press, 2012. – С. 305-322. [12] Zhang X. et al. PICS: probabilistic inference for ChIP‐seq //Biometrics. – 2011. – Т. 67. – №. 1. – С. 151-163. [13] Landt S. G. et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia // Genome research. – 2012. – Т. 22. – №. 9. – С. 1813-1831. [14] Chung D., Zhang Q., Keles S. MOSAiCS-HMM: A model-based approach for detecting regions of histone modifications from ChIP-seq data //Statistical Analysis of Next Generation Sequencing Data. – Springer, Cham, 2014. – С. 277-295. [15] Chen K. et al. DANPOS: dynamic analysis of nucleosome position and occupancy by sequencing //Genome research. – 2013. – Т. 23. – №. 2. – С. 341-351. [16] Rynes E. et al. Hotspot2: Program for identifying genomic regions with statistically significant "hotspots," or enrichments, of cleavage activity in DNase-seq experiments. https://github.com/Altius/hotspot2 [17] Genrich: a peak-caller for genomic enrichment assays. https://github.com/jsh58/Genrich [18] Piper J. et al. Wellington: a novel method for the accurate identification of digital genomic footprints from DNase- seq data // Nucleic acids research. – 2013. – Т. 41. – №. 21. – С. e201-e201. [19] Kolmykov S. K. et al. Population size estimation for quality control of ChIP-Seq datasets //PloS one. – 2019. – Т. 14. – №. 8. 22