Data intensive analysis approaches in genomics and
                   proteomics: ELIXIR initiatives
                                      (Extended abstract of an invited talk)

                                        © Alexander A. Kanapin
                        Department of Oncology, University of Oxford, Oxford, UK

                                     alexander.kanapin@oncology.ox.ac.uk


                        Abstract                                     [1] were published in 1985 and 1991 respectively. From
                                                                     the very beginning and up to present time, the majority
                                                                     of the data deposited in the biological databanks
   Breakthrough in genome sequencing technologies                    consists of sequences of biopolymers, namely nucleic
resulted in the unprecedented growth of data volumes in              acids and proteins.
genomics and proteomics. New paradigm of precision                       A successful sequencing of human genome draft in
medicine signifies wide practical usage of these types of            2001 [6] presented a next big step in the development of
data. ELIXIR, a pan-European bioinformatics                          bioinformatics and gave a tremendous momentum to
consortium meets the challenges arising from                         creation of new computational engineering solutions
production, storage and analysis of massive data                     and design of novel algorithms for genomic and
collections in genomics and proteomics and proposes                  proteomic data analysis [10].
several pilot programs, which aim to develop standards                   A progress in biological data acquisition
and algorithms for the data analysis. The                            technologies still remains one of major driving forces in
interdisciplinary initiatives of the consortium, such as             bioinformatics. Next Generation Sequencing (NGS)
"BILS-ProteomeXchange integration using EUDAT                        techniques allow to obtain complete genome sequences
resources” and "Interoperability of protein resources                in a cheap and fast way. This development may change
for drug discovery: Improving Links Between the                      paradigm of traditional medicine towards personal and
Human Protein Atlas (HPA) and EMBL-EBI Protein                       precise approaches to each of individual patients [11].
Resources” are of great interest and their successful                However, at the same time it creates new challenges in
implementation requires collaboration of researchers                 data intensive analytics for both data storage and
and IT engineers. The article also describes general                 manipulation technologies and algorithmic approaches.
principles of the consortium organization and potential                  The practical solution of such tasks is only possible
ways of participation in its collaboration projects and              in a framework of international consortia and
programs.                                                            collaboration. ELIXIR, a pan-European consortium in
                                                                     bioinformatics opens new opportunities for successful
1 Introduction                                                       establishment of collaboration in the pilot initiatives of
                                                                     the consortium.
    Biology traditionally was a science based on
quantitative observations, and in contrast to physics, it
produced relatively small amounts of qualitative data.               2 Bioinformatics resources
The situation dramatically changed in a last quarter of                 Data management in bioinformatics gradually
XX century. A rapid progress in new technologies of                  evolves with the increasing volumes of the biological
analysis of living systems (cells and organisms) on                  data. Historically, the protein and nucleic acids
molecular level resulted in a burst of data, primarily               databanks delivered the information via CD and other
describing features of biological molecules, such as                 similar media. Later, when networking bandwidth
nucleic acids and proteins. A matching appearance of                 allowed downloading large amounts of data, the
personal computers and global networks facilitated the               databases became available as downloadable flat files.
storage and processing of such information in both                   At present time the bioinformatics resources may be
small and large scale.                                               classified using the following rough categories:
    As a result, a new discipline emerged in 1989, when              x Data repositories. The public or commercial data
the term “bioinformatics” was mentioned in a title of a                   banks containing primary sequences and structures
scientific paper for the first time [7]. The first databases              of biopolymers. The repositories also contain tools
of primary structures of nucleic acids [3] and proteins                   to analyse data provided by user in a context of the
                                                                          resource. Examples: UniProt, GenBank, RSCB
  Proceedings of the XVII International Conference                        PDB.
  «Data Analytics and Management in Data                             x Analytical toolboxes. The complex portals
  Intensive Domains» (DAMDID/RCDL’2015),                                  providing exclusive algorithms for user data
  Obninsk, Russia, October 13 - 16, 2015


                                                               257
    analysis.                                                            different species and various biological conditions.
x   Bioinformatics cloud resources.
                                                                     High quality manual curation and verification of the
3 ELIXIR: pan-European collaboration in                              information in the databases ensures the reliability of
bioinformatics                                                       the data available. Internal connectivity and integration
                                                                     between the different resources in the Institute allows
   The ELIXIR consortium was founded in 2006 by                      high level of data integrity and consistency.
European Laboratory for Molecular Biology (EMBL).
The consortium officially started as a fully functional              4.2 Proteomics in ELIXIR
body in December 2013 when the consortium                               Proteomics research makes a significant part of the
agreement was signed by the first member states. At                  consortium scientific programme. Several protein and
present it includes 12 full members and 6 observers.                 protein expression resources have been established in
    The major goal of ELIXIR is coordination of efforts              Europe, containing valuable information for biomedical
in quality control and archiving of life sciences data in            research. Seamless navigation between these resources
pan-European scale. The complexity of the data and its               is an important prerequisite for scientists to make
heterogeneity calls for creation of infrastructure and               informed decisions about their research into new drug
system of standards as well as development of proper                 targets and are exploring links between different
training programs. ELIXIR will act as a sustainable                  proteins in healthy and diseased tissues. Swedish
repository for life science data that has been funded by             national node of ELIXIR plays an important role in this
the public [2].                                                      action, working with EMBL-EBI. The consolidated
    The consortium is organized as a network of                      efforts make the Human Protein Atlas interoperable
interactions between central hub (Hinxton, UK) and                   with such proteomic resources as PRIDE, InterPro, and
national nodes in each of the member states. The                     the Gene Expression Atlas.
participation in research pilot initiatives is opened to all
scientific organizations of the member states.                       4.3 BILS - ProteomeXchange
    Currently, ELIXIR is unfolding its activities through
series of pilot programs and initiatives. The scientific                An arrival of tremendous volumes of biological data
program of the consortium proposes several research                  calls for a need for distributed data storage and
and development avenues along the main directions of                 replication and reliable and scalable data access
future development of data intensive analytics in                    interface. One of the ELIXIR pilot initiatives aims to
biological sciences [5].                                             integrate the raw data repositories for mass
                                                                     spectrometry proteomics data run by Bioinformatics
                                                                     Infrastructure for Life Sciences (BILS, Sweden) and
4 Data intensive analytical programs in                              ProteomeXchange consortium via the PRIDE database,
proteomics and genomics                                              hosted in EMBL-EBI, UK. The key point in the
                                                                     infrastructure is provided by the European infrastructure
4.1 Integrative genomics initiatives in ELIXIR                       EUDAT (http://www.eudat.eu/). The ProteomeXchange
                                                                     consortium facilitates submission and standardization of
   Comprehensive resources of various data modalities                dissemination practices for proteomics data resources.
in genomics is essential prerequisite for modern                     The main goal of the consortium is to develop a
research in biological sciences and translational                    framework to allow standard data submission and
medicine. EMBL-EBI pioneers the initiative since the                 dissimentaion pipelines between main proteomic
creation of one of the first nucleotide sequences                    repositories, such as PeptideAtlas, PRIDE and
database, EMBL-base. Now, as a part of ELIXIR                        MassIVE. The consortium encompasses 1963
services it provides a diverse spectrum of genomics                  proteomics datasets as of may 2015. PRIDE, one of key
data, the most outstanding of them are:                              participants, stores MS-based proteomics data, such as
                                                                     protein      expression       data,     post-translational
x   ENA – European nucleotide archive, centred                       modifications, raw MS data and technical metadata.
    around nucleotide sequencing. The resource                            BILS is a distributed national research
    contains raw sequencing data, sequence assembly                  infrastructure, supported by the Swedish Research
    and functional annotation of the data                            Council, its bioinformatics networks includes 6 nodes in
x   EnsEMBL – unique genome annotation resource                      major Swedish universities. Proteios, a multi-user
    containing high-quality integrated annotation on                 platform for analysis and management of proteomics
    vertebrate genomes. The resource comprises data                  data was developed as an essential part of the
    mining interface, BioMart for data retrieval.                    integrative initiatives of BILS.
x   European Variation Archive – a recent                                EUDAT is a pan-European project aiming at
    development of the novel approach to genomic                     building and operating of global collaborative data
    data, the database contains all types of genetic                 infrastructure for preserving and exchange of scientific
    variation data                                                   data in various disciplines. Essential components of its
x   Expression Atlas – RNA-related portal, collecting                software ecosystem, such as B2SAFE and iRODS
    information about gene expression patterns in                    ensure robust, safe and highly available data access.


                                                               258
B2SAFE software is a key component of the                          expression data. In collaboration with HPA a new DAS
ProteomeXchange data infrastructure.                               service was created to provide expression summaries.
     The initiative could serve as an example of                   Collaboration with other resources, such as UniProt,
engagement of various types of data storage services in            PDB, pFam, InterPro, PRIDE and IntAct continues,
ELIXIR and demonstrate the potential of collaboration              aiming to to create a BioJS component to standardize
among research infrastructures and e-infrastructures to            the visualization of protein features which will be used
better manage the data deluge.                                     to represent related expression data such as antibody
                                                                   binding and protein identifications.
4.4 Protein resources in drug discovery                                  One of major challenges for expression
   Important aspects of many genetic diseases are                  information integration among the listed sources is the
reflected in potentially different roles of proteins and           metadata annotation. The metadata harmonisation
pathways in diverse cell lineages. Interoperability                implementation is planned as a next step, based on
between      databases     providing    tissue-specificity         Experimental Factors Ontology (EFO) as a reference
information and describing expression of genes and                 system. HPA also proposes XML solution, which is
proteins in multiple tissues at different stages of                more standardized, and flexible than DAS and might
development in different diseased conditions becomes               suit better as means of data exchange.
critically important for the modern approaches in drug
discovery. The heterogeneity of the data representation            References
in these expression resources poses a challenge as they
                                                                    [1] Bairoch A, Boeckmann B. The SWISS-PROT
often complement each other and different providers
                                                                         protein sequence data bank. Nucl. Ac. Res., v. 19,
follow different rules to annotate and provide the
                                                                         p. 2247-2249, 1991
information. The major goal of the ELIXIR pilot is to
define and implement standards and tools to facilitate              [2] Blomberg N. ELIXIR: Data for life. 2014,
access and integration of the data for the scientific                    https://www.elixireurope.org/system/files/ELIXIR
community. The proteomics and expression resources in                    _2014_brochure_full.pdf
the framework include:                                              [3] Burks C, et al. The GenBank nucleic acid
                                                                         sequence database. Comput. Appl. Biosci., v.4, p.
x    The Human Protein Atlas (HPA) [4], a database of                    225-233, 1985
     protein      expression     profiles     based     on          [4] Colwill K; Renewable Protein Binder Working
     immunohistochemistry.                                               Group, Gräslund S. A roadmap to generate
x The PRoteomics IDEntifications database (PRIDE)                        renewable protein binders to the human proteome.
     [9], a public data repository for protein and peptide               Nat Methods., v. 15, p.551-558, 2011.
     identifications.                                               [5] ELIXIR consortium. Scientific programme 2014-
x The Gene Expression Atlas (GXA) [8], an enriched                       2018. Executive summary. 2015,
     database of gene expression patterns.                               https://www.elixir-
   The project proposes the following integration                        europe.org/system/files/ELIXIR-Executive-
strategies. First, summaries of information from                         Summary-2015_Digital.pdf
different databases based on a single entry point and on            [6] Lander E. et al. Initial sequencing and analysis of
a common format will be created. The approach was                        the human genome. Nature, v. 409, p. 860-921,
successfully introduced before by the EMBL-EBI                           2001
search portal and includes an amalgamation of service               [7] Masys D. New directions in bioinformatics. J. of
layers on top of a database providing summary data in a                  Res. Nat. Inst. Stand. and Techn. v.94, p. 59-63,
standard manner, while the original resources do not                     1989
change their data or schemas. The non-intrusive
                                                                    [8] Petryszak R, et al. Expression Atlas update--a
approach ensures the independence of the original
                                                                         database of gene and transcript expression from
sources and provides on demand integration. The
                                                                         microarray- and sequencing-based functional
second approach was adopted by Biosapiens consortium
                                                                         genomics experiments. Nucleic Acids Res., v. 42,
and defines a common terminology and format to
                                                                         p. 926-932, 2014.
describe minimum information for specific data entries.
It provides a common language and standard format of                [9] Reisinger F. et al. Introducing the PRIDE Archive
the data to integrate and compare protein annotations                    RESTful web services. Nucleic Acids Res., pii:
for 39 databases. The strategy requires an agreement on                  gkv382, 2015
control vocabularies and changes that might affect data            [10] Thornton J. The future of bioinformatics. Trends
content and annotation process and is therefore more                     in Biotechn., v. 17, p. 30-31, 1998.
challenging task for the data providers.                               Topol E. Individualized medicine from prewomb to
      Distributed Annotation System (DAS) was used as                  tomb. Cell, v. 157, p.241-253, 2014
a communication fabrics to disseminate protein
expression summary data and protein sequence
annotations, as GXA and PRIDE use DAS to provide


                                                             259