<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Infrastructure for Disease Prevention and Precision Medicine</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marco Moscatelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Manconi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Gnocchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luciano Milanesi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Biomedical Technologies, National Research Council</institution>
          ,
          <addr-line>Segrate (Mi)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Precision medicine is an emerging and novel approach for both disease treatment and prevention. Precision medicine allows to classify individuals into subpopulations that di er in their susceptibility to a particular disease with the aim to tailor the medical treatment to the individual characteristics of each patient. To provide precision medicine to patients researchers needs to analyze huge amounts of heterogeneous data from both biomedical research and healthcare systems. The growing amount of these data gives rise to the need for new research methods and analysis techniques. In this paper we present a infrastructure that exploits new strategies aimed at storing, accessing, and analyzing e ciently these heterogeneous data.</p>
      </abstract>
      <kwd-group>
        <kwd>Big Data</kwd>
        <kwd>Disease Prevention</kwd>
        <kwd>Precision Medicine</kwd>
        <kwd>HPC</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Motivation</title>
      <p>
        Nowadays, advances in technology has arisen in a huge amount of data in both
biomedical research and healthcare systems. This growing amount of data gives
rise to the need for new research methods and analysis techniques. Analysis of
these data o ers new opportunities to de ne novel diagnostic processes.
Therefore, a greater integration between healthcare and biomedical data is essential
to devise novel predictive models in the eld of biomedical diagnosis. In this
context, the digitalization of clinical exams and medical records is becoming
essential to collect heterogeneous information. Analysis of these data by means of
big data technologies will allow a more in depth understanding of the
mechanisms leading to diseases, and contextually it will facilitate the development of
novel diagnostics and personalized therapeutics. The recent application of big
data technologies in the medical elds will o er new opportunities to integrate
enormous amount of medical and clinical information from population studies.
Therefore, it is essential to devise new strategies aimed at storing, accessing, and
analyzing the data in a standardized way. Moreover, it is important to provide
suitable methods to manage these heterogeneous data.
relational databases does not lend to the nature of these data. In our opinion,
better results can be obtained using non-relational (NoSQL) databases. Starting
from these considerations, the infrastructure has been developed on a NoSQL
database with the aim to combine scalability and exibility performances. In
particular MongoDB [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has been used as it ts better to manage di erent types
of data on large scale. In doing so, the infrastructure is able to provide an
optimized management of huge amounts of heterogeneous data, while ensuring high
speed of analysis.
      </p>
      <p>With the aim to enable researchers to perform their analysis through
dedicated computing resources, the infrastructure has been built on a hardware
platform intended to enable big-data classes of applications which consists of:
{ a massive storage platform of 1.6 PB;
{ 2040 CPU cores;
{ 16 NVIDIA K20 GPUs;
{ 2 big memory nodes (i.e., 1 node equipped with 1TB and 1 node equipped
with 512GB of memory).</p>
      <p>
        It should be pointed out that the concept of precision medicine is about the
customization of healthcare, with decisions and practices tailored to an
individual patient based on their genome sequence, microbiome composition, lifestyle,
and diet in addition to medical and clinical data. Therefore, in addition to the
data about the patient collected by healthcare providers, researchers need to
incorporate many di erent types of data. To this end, a web-based platform
built upon the Galaxy technology [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] has also been implemented to enable
researchers to retrieve and analyze biological data in their analyses. Galaxy is an
open web-based scienti c work ow system for data intensive biomedical research
accessible to researchers that do not have programming experience. By default,
Galaxy is designed to run jobs on local systems. However, it can also be con
gured to run jobs on a cluster. The front-end Galaxy application runs on a single
server, but tools are run on cluster nodes instead. To this end, Galaxy supports
di erent distributed resource managers with the aim to enable di erent clusters.
For the speci c case, in our opinion SLURM [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] represents the most suitable
workload manager to manage and control jobs on the above hardware
infrastructure. SLURM is a highly con gurable workload and resource manager and
it is currently used on six of the ten most powerful computers in the world.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Results</title>
      <p>
        The presented infrastructure exploits big data technologies in order to overcome
the limitations of relational databases when working with large and
heterogeneous data. The infrastructure implements a set of interface procedures aimed
at preparing the metadata for importing data in a NoSQL DB. Moreover, data
can also be represented as a graph using Neo4j [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The Neo4J DB allows you
to emphasize and enhance the connections between the data and facilitate the
retrieve and navigation of data. Experimental tests on huge amount of data show
that our infrastructure exhibits performances in terms of speed and scalability
unachievable with relational databases. These performances are mainly related
to ability of the infrastructure to index any type of eld as well as to customize
the queries. In particular, the high exibility to customize the queries increases
the search performance and speci city of the results.
      </p>
      <p>Moreover, the robust hardware infrastructure together with the Galaxy
webbased platform allow to easily integrate and analyze heterogeneous data from
di erent biological sources.</p>
      <p>Currently, the infrastructure is used in a project aimed at implementing
techniques to infer the predisposition to some cancer diseases. The project is
funded by the Fondazione Bracco.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Supplementary Information</title>
      <p>This work has been supported by the Fondazione Bracco, the Italian Ministry
of Education and Research Flagship (PB05) InterOmics project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. http://www.mongodb.org</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Goecks</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nekrutenko</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , J., The Galaxy Team:
          <article-title>Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences</article-title>
          .
          <source>Genome Biol</source>
          .
          <source>2010 Aug</source>
          <volume>25</volume>
          ;
          <issue>11</issue>
          (
          <issue>8</issue>
          ):
          <fpage>R86</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>3. http://slurm.schedmd.com/</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>4. http://neo4j.com/</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>