Data intensive analysis approaches in genomics and proteomics: ELIXIR initiatives (Extended abstract of an invited talk) © Alexander A. Kanapin Department of Oncology, University of Oxford, Oxford, UK alexander.kanapin@oncology.ox.ac.uk Abstract [1] were published in 1985 and 1991 respectively. From the very beginning and up to present time, the majority of the data deposited in the biological databanks Breakthrough in genome sequencing technologies consists of sequences of biopolymers, namely nucleic resulted in the unprecedented growth of data volumes in acids and proteins. genomics and proteomics. New paradigm of precision A successful sequencing of human genome draft in medicine signifies wide practical usage of these types of 2001 [6] presented a next big step in the development of data. ELIXIR, a pan-European bioinformatics bioinformatics and gave a tremendous momentum to consortium meets the challenges arising from creation of new computational engineering solutions production, storage and analysis of massive data and design of novel algorithms for genomic and collections in genomics and proteomics and proposes proteomic data analysis [10]. several pilot programs, which aim to develop standards A progress in biological data acquisition and algorithms for the data analysis. The technologies still remains one of major driving forces in interdisciplinary initiatives of the consortium, such as bioinformatics. Next Generation Sequencing (NGS) "BILS-ProteomeXchange integration using EUDAT techniques allow to obtain complete genome sequences resources” and "Interoperability of protein resources in a cheap and fast way. This development may change for drug discovery: Improving Links Between the paradigm of traditional medicine towards personal and Human Protein Atlas (HPA) and EMBL-EBI Protein precise approaches to each of individual patients [11]. Resources” are of great interest and their successful However, at the same time it creates new challenges in implementation requires collaboration of researchers data intensive analytics for both data storage and and IT engineers. The article also describes general manipulation technologies and algorithmic approaches. principles of the consortium organization and potential The practical solution of such tasks is only possible ways of participation in its collaboration projects and in a framework of international consortia and programs. collaboration. ELIXIR, a pan-European consortium in bioinformatics opens new opportunities for successful 1 Introduction establishment of collaboration in the pilot initiatives of the consortium. Biology traditionally was a science based on quantitative observations, and in contrast to physics, it produced relatively small amounts of qualitative data. 2 Bioinformatics resources The situation dramatically changed in a last quarter of Data management in bioinformatics gradually XX century. A rapid progress in new technologies of evolves with the increasing volumes of the biological analysis of living systems (cells and organisms) on data. Historically, the protein and nucleic acids molecular level resulted in a burst of data, primarily databanks delivered the information via CD and other describing features of biological molecules, such as similar media. Later, when networking bandwidth nucleic acids and proteins. A matching appearance of allowed downloading large amounts of data, the personal computers and global networks facilitated the databases became available as downloadable flat files. storage and processing of such information in both At present time the bioinformatics resources may be small and large scale. classified using the following rough categories: As a result, a new discipline emerged in 1989, when x Data repositories. The public or commercial data the term “bioinformatics” was mentioned in a title of a banks containing primary sequences and structures scientific paper for the first time [7]. The first databases of biopolymers. The repositories also contain tools of primary structures of nucleic acids [3] and proteins to analyse data provided by user in a context of the resource. Examples: UniProt, GenBank, RSCB Proceedings of the XVII International Conference PDB. «Data Analytics and Management in Data x Analytical toolboxes. The complex portals Intensive Domains» (DAMDID/RCDL’2015), providing exclusive algorithms for user data Obninsk, Russia, October 13 - 16, 2015 257 analysis. different species and various biological conditions. x Bioinformatics cloud resources. High quality manual curation and verification of the 3 ELIXIR: pan-European collaboration in information in the databases ensures the reliability of bioinformatics the data available. Internal connectivity and integration between the different resources in the Institute allows The ELIXIR consortium was founded in 2006 by high level of data integrity and consistency. European Laboratory for Molecular Biology (EMBL). The consortium officially started as a fully functional 4.2 Proteomics in ELIXIR body in December 2013 when the consortium Proteomics research makes a significant part of the agreement was signed by the first member states. At consortium scientific programme. Several protein and present it includes 12 full members and 6 observers. protein expression resources have been established in The major goal of ELIXIR is coordination of efforts Europe, containing valuable information for biomedical in quality control and archiving of life sciences data in research. Seamless navigation between these resources pan-European scale. The complexity of the data and its is an important prerequisite for scientists to make heterogeneity calls for creation of infrastructure and informed decisions about their research into new drug system of standards as well as development of proper targets and are exploring links between different training programs. ELIXIR will act as a sustainable proteins in healthy and diseased tissues. Swedish repository for life science data that has been funded by national node of ELIXIR plays an important role in this the public [2]. action, working with EMBL-EBI. The consolidated The consortium is organized as a network of efforts make the Human Protein Atlas interoperable interactions between central hub (Hinxton, UK) and with such proteomic resources as PRIDE, InterPro, and national nodes in each of the member states. The the Gene Expression Atlas. participation in research pilot initiatives is opened to all scientific organizations of the member states. 4.3 BILS - ProteomeXchange Currently, ELIXIR is unfolding its activities through series of pilot programs and initiatives. The scientific An arrival of tremendous volumes of biological data program of the consortium proposes several research calls for a need for distributed data storage and and development avenues along the main directions of replication and reliable and scalable data access future development of data intensive analytics in interface. One of the ELIXIR pilot initiatives aims to biological sciences [5]. integrate the raw data repositories for mass spectrometry proteomics data run by Bioinformatics Infrastructure for Life Sciences (BILS, Sweden) and 4 Data intensive analytical programs in ProteomeXchange consortium via the PRIDE database, proteomics and genomics hosted in EMBL-EBI, UK. The key point in the infrastructure is provided by the European infrastructure 4.1 Integrative genomics initiatives in ELIXIR EUDAT (http://www.eudat.eu/). The ProteomeXchange consortium facilitates submission and standardization of Comprehensive resources of various data modalities dissemination practices for proteomics data resources. in genomics is essential prerequisite for modern The main goal of the consortium is to develop a research in biological sciences and translational framework to allow standard data submission and medicine. EMBL-EBI pioneers the initiative since the dissimentaion pipelines between main proteomic creation of one of the first nucleotide sequences repositories, such as PeptideAtlas, PRIDE and database, EMBL-base. Now, as a part of ELIXIR MassIVE. The consortium encompasses 1963 services it provides a diverse spectrum of genomics proteomics datasets as of may 2015. PRIDE, one of key data, the most outstanding of them are: participants, stores MS-based proteomics data, such as protein expression data, post-translational x ENA – European nucleotide archive, centred modifications, raw MS data and technical metadata. around nucleotide sequencing. The resource BILS is a distributed national research contains raw sequencing data, sequence assembly infrastructure, supported by the Swedish Research and functional annotation of the data Council, its bioinformatics networks includes 6 nodes in x EnsEMBL – unique genome annotation resource major Swedish universities. Proteios, a multi-user containing high-quality integrated annotation on platform for analysis and management of proteomics vertebrate genomes. The resource comprises data data was developed as an essential part of the mining interface, BioMart for data retrieval. integrative initiatives of BILS. x European Variation Archive – a recent EUDAT is a pan-European project aiming at development of the novel approach to genomic building and operating of global collaborative data data, the database contains all types of genetic infrastructure for preserving and exchange of scientific variation data data in various disciplines. Essential components of its x Expression Atlas – RNA-related portal, collecting software ecosystem, such as B2SAFE and iRODS information about gene expression patterns in ensure robust, safe and highly available data access. 258 B2SAFE software is a key component of the expression data. In collaboration with HPA a new DAS ProteomeXchange data infrastructure. service was created to provide expression summaries. The initiative could serve as an example of Collaboration with other resources, such as UniProt, engagement of various types of data storage services in PDB, pFam, InterPro, PRIDE and IntAct continues, ELIXIR and demonstrate the potential of collaboration aiming to to create a BioJS component to standardize among research infrastructures and e-infrastructures to the visualization of protein features which will be used better manage the data deluge. to represent related expression data such as antibody binding and protein identifications. 4.4 Protein resources in drug discovery One of major challenges for expression Important aspects of many genetic diseases are information integration among the listed sources is the reflected in potentially different roles of proteins and metadata annotation. The metadata harmonisation pathways in diverse cell lineages. Interoperability implementation is planned as a next step, based on between databases providing tissue-specificity Experimental Factors Ontology (EFO) as a reference information and describing expression of genes and system. HPA also proposes XML solution, which is proteins in multiple tissues at different stages of more standardized, and flexible than DAS and might development in different diseased conditions becomes suit better as means of data exchange. critically important for the modern approaches in drug discovery. The heterogeneity of the data representation References in these expression resources poses a challenge as they [1] Bairoch A, Boeckmann B. The SWISS-PROT often complement each other and different providers protein sequence data bank. Nucl. Ac. Res., v. 19, follow different rules to annotate and provide the p. 2247-2249, 1991 information. The major goal of the ELIXIR pilot is to define and implement standards and tools to facilitate [2] Blomberg N. ELIXIR: Data for life. 2014, access and integration of the data for the scientific https://www.elixireurope.org/system/files/ELIXIR community. The proteomics and expression resources in _2014_brochure_full.pdf the framework include: [3] Burks C, et al. The GenBank nucleic acid sequence database. Comput. Appl. Biosci., v.4, p. x The Human Protein Atlas (HPA) [4], a database of 225-233, 1985 protein expression profiles based on [4] Colwill K; Renewable Protein Binder Working immunohistochemistry. Group, Gräslund S. A roadmap to generate x The PRoteomics IDEntifications database (PRIDE) renewable protein binders to the human proteome. [9], a public data repository for protein and peptide Nat Methods., v. 15, p.551-558, 2011. identifications. [5] ELIXIR consortium. Scientific programme 2014- x The Gene Expression Atlas (GXA) [8], an enriched 2018. Executive summary. 2015, database of gene expression patterns. https://www.elixir- The project proposes the following integration europe.org/system/files/ELIXIR-Executive- strategies. First, summaries of information from Summary-2015_Digital.pdf different databases based on a single entry point and on [6] Lander E. et al. Initial sequencing and analysis of a common format will be created. The approach was the human genome. Nature, v. 409, p. 860-921, successfully introduced before by the EMBL-EBI 2001 search portal and includes an amalgamation of service [7] Masys D. New directions in bioinformatics. J. of layers on top of a database providing summary data in a Res. Nat. Inst. Stand. and Techn. v.94, p. 59-63, standard manner, while the original resources do not 1989 change their data or schemas. The non-intrusive [8] Petryszak R, et al. Expression Atlas update--a approach ensures the independence of the original database of gene and transcript expression from sources and provides on demand integration. The microarray- and sequencing-based functional second approach was adopted by Biosapiens consortium genomics experiments. Nucleic Acids Res., v. 42, and defines a common terminology and format to p. 926-932, 2014. describe minimum information for specific data entries. It provides a common language and standard format of [9] Reisinger F. et al. Introducing the PRIDE Archive the data to integrate and compare protein annotations RESTful web services. Nucleic Acids Res., pii: for 39 databases. The strategy requires an agreement on gkv382, 2015 control vocabularies and changes that might affect data [10] Thornton J. The future of bioinformatics. Trends content and annotation process and is therefore more in Biotechn., v. 17, p. 30-31, 1998. challenging task for the data providers. Topol E. Individualized medicine from prewomb to Distributed Annotation System (DAS) was used as tomb. Cell, v. 157, p.241-253, 2014 a communication fabrics to disseminate protein expression summary data and protein sequence annotations, as GXA and PRIDE use DAS to provide 259