<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Data-Driven Genomic Computing: Making Sense of Signals from the Genome</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>(Extended Abstract)</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Figure 1 Cost of DNA Sequencing</institution>
          ,
          <addr-line>Source: NIH</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Proceedings of the XIX International Conference “Data Analytics and Management in Data Intensive Domains” (DAMDID/RCDL'2017)</institution>
          ,
          <addr-line>Moscow</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Genomic computing is facing a technological revolution. In this paper, we argue that the most important problem of genomic computing is tertiary data analysis, concerned with the integration of heterogeneous regions of the genome - because regions carry important signals, and the creation of new biological or clinical knowledge requires the integration of these signals into meaningful messages.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Genomics is a relatively recent science. Historically,
the double helix model of DNA, due to Nobel prices
James Watson and Francis Crick, was published on
Nature on April 1953; and the first draft of the human
genome, produced as result of the Human Genome
Project, was published on Nature in February 2001, with
the full sequence completed and published in April
2003. The Human Genome Project, primarily funded by
the National Institutes of Health (NIH), was the result
of a collective effort involving twenty universities and
research centers in the United States, the United
Kingdom, Japan, France, Germany, Canada, and China.</p>
      <p>In the last 15 years, the technology for DNA
sequencing has made gigantic steps. Figure 1 shows the
cost of DNA sequencing in the last fifteen years; by
inspecting the curve, one can note a huge drop around
2008, with the introduction of Next Generation
Sequencing, a high-throughput, massively parallel
technology based upon the use of image capturing. The cost
of producing a complete human sequence dropped to
1000 US$ in 2015 and is expected to drop below 100
US$ in 2018.</p>
      <p>
        Each sequencing produces a mass of information
(raw data) in the form of “short reads” of genome
strings. Once stored, the raw data of a single genome
reach a typical size of 200Gbyte; it is expected [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] that
between 100 million and 2 billion human genomes will
be sequenced by 2025, thereby generating the biggest
“big data” problem for the mankind.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 From strings to signals</title>
      <p>Technological development also marked the
generation of new methods for extracting signals from the
genome, and this in turn is helping us in better
understanding the information that the genome is bringing to
us. Our concept of genome has evolved, from
considering it as a long string of 32 billions of base pairs,
encoding adenine (A), cytosine (C), guanine (G), and thymine
(T), to that of a living system producing signals, to be
integrated and interpreted.</p>
      <p>The most interesting signals can be classified as:
(a) mutations, telling us specific positions or region
of the genome where the code of an individual differs
from the expected code of the “reference” human being.
Mutations are associated with genetic diseases – which
are inherited; mutations occur on specific positions of
the genes – and other diseases such as cancer – which
are also produced during the human life, and correlate
with factors such as nutrition and pollution.</p>
      <p>(b) gene expression, telling us in which specific
conditions genes are active (i.e. they transcribe a
protein) or inactive. It turns out that the same gene may
have a big activity in given conditions and no activity in
other.</p>
      <p>(c) peaks of expression, indicating specific position
of the genome where there is an increase of short reads
due to a specific treatment of DNA; these in turn
indicate specific biological events, such as the binding of a
protein to the DNA.</p>
      <p>These signals can be observed by using a genome
browser, i.e. a viewer of the genome. All signals are
aligned to a reference genome (the standard sequence
characterizing human beings; such sequence is
constantly improved and republished by the scientific
community). The browser is open on a window of a
given length (from few bases to millions of bases), and
the signals are presented as tracks on the browser; each
track, in turn, show the signal – either by showing their
intensity or just by showing their position. Figure 2
presents a window; the red, blue, and yellow tracks
describe gene expression, peaks of expressions, and
mutations. The black line indicates the position of (four)
genes – these are known information, or “annotations”,
that can be included in the window. An interesting
biological question could be: “find genes which are
expressed, where there are three peaks (i.e., peaks
representing three experiments are confirmed by all
experiments) and with at least one mutation. Such question
would, in this specific example, extract the second gene.</p>
    </sec>
    <sec id="sec-3">
      <title>3 Tertiary Data Analysis and GeCo</title>
      <p>Signals can be loaded on the browser only after
being produced as result of long and complex
bioinformatics pipelines. In particular, analysis of NGS
data is classified as primary, secondary and tertiary (see
Figure 3): primary analysis is essentially responsible of
producing raw data; secondary analysis is responsible of
extracting (“calling”) the signal from raw data and align
the signals to the reference genome; and tertiary
analysis is responsible of a number of tasks all concerned
with data integration.</p>
      <p>Genes
Expressi
on</p>
      <p>The bioinformatics community has produced a huge
number of tools for secondary analysis. So far, it has
not been equally engaged in tertiary data analysis,
which is clearly the most important aspect of future
research.</p>
      <p>
        GeCo is developed by our group at Politecnico di
Milano as outcome of an Advanced ERC Grant. GeCo
is based on GMQL, a high-level language for genomic
data management, and has a system architecture based
on cloud computing, implemented on engines such as
Spark and Flink [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. SciDB, a scientific database
produced by the spinoff company Paradigm4, supports a
genomic addition focused on genomic data integration
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. DeepBlue, provides easy access to datasets
produced within the BluePrint consortium, with a language
which is quite similar to GMQL [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. FireCloud,
developed by the Broad Institute, offers an integrated
platform supporting cancer research [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. All these systems
already support access to a huge number of open
datasets, including ENCODE and TCGA.
      </p>
      <p>This two-page abstract has set the stage for discussing
why GeCo is an important project in the context of
tertiary data analysis for genomics. In the full paper, we
will describe some of the aspects of the GeCo project;
we will focus on the GeCo API (not been presented so
far). This is an important aspect of the project, as it
guarantees usability of the system from multiple user
and language interfaces, thereby allowing
interoperability.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Stephens</surname>
            ,
            <given-names>Z. D.</given-names>
          </string-name>
          et al.:
          <source>Big Data: Astronomical or Genomical? PLoS Biol</source>
          ;
          <volume>13</volume>
          (
          <issue>7</issue>
          ) (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Kaitoua</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinoli</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bertoni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ceri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Framework for Supporting Genomic Operations</article-title>
          , IEEE-TC,
          <volume>10</volume>
          .1109/TC.
          <year>2016</year>
          .
          <volume>2603980</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Masseroli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          et al.:
          <article-title>GenoMetric Query Language: A novel approach to large-scale genomic data management</article-title>
          .
          <source>Bioinformatics</source>
          <volume>31</volume>
          (
          <issue>12</issue>
          ),
          <volume>10</volume>
          .1093/bioinformatics/btv048 (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Anonymous</surname>
            <given-names>paper</given-names>
          </string-name>
          ,
          <article-title>Accelerating Bioinformatics Research with New Software gor Big Data to Knowledge (BD2K), Paradigm4 Inc</article-title>
          . (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Albrecht</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          et al.:
          <article-title>DeepBlue Epigenomic Data Server: Programmatic Data Retrieval and Analysis of the Epigenome</article-title>
          ,
          <source>Nucleid Acids Research</source>
          ,
          <volume>44</volume>
          /W1 (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>[6] https://software.broadinstitute.org/firecloud</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>