<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Facing the Challenges of Genome Information Systems: a Variation Analysis Prototype.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ana M. Martinez</string-name>
          <email>amartinez@pros.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ainoha Martn</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria JosØ Villanueva</string-name>
          <email>mvillanueva@pros.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francisco Valverde</string-name>
          <email>fvalverde@pros.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ana M. Levn</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oscar Pastor</string-name>
          <email>opastor@pros.upv.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centro de Investigacin en MØtodos de Produccin de Software Universidad PolitØcnica de Valencia Camino de Vera S/N 46022</institution>
          ,
          <addr-line>Valencia</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In Bioinformatics there is a lack of software tools that t with the requirements demanded by biologists. For instance when a DNA sample is sequenced, a lot of work must be performed manually and several tools are used. The application of Information Systems (IS) principles into the development of bioinformatic tools, opens a new interesting research path. One of the most promising approaches is the use of conceptual models in order to precisely dene how genomic data is represented into an IS. This work introduces how to build a Genome Information System (GIS) using these principles. As a rst step to achieve this goal, a conceptual model to formally describe genomic mutations is presented. In addition, as a proof of concept of this approach, a variation analysis prototype has been implemented using this conceptual model as a development core.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Thanks to the breakthrough of the Human Genome Project and the advances
in DNA sequencing, an enormous amount of genetic data is being produced by
researchers every day. Most of these experiments are focused on the
understanding of the relationship between genotype (gene conguration and combination
of a particular individual) and phenotype (expression of the genes in a specic
human feature). As a consequence, the creation of biological databases and tools
to exploit the produced data have grown drastically. However, these tools and
databases have usually been dened to support an specic research area or
experiment. Therefore, when biologists want to use them for a particular assay, it is
very unlikely that they support their specic requirements. This issue leads to a
situation where the researcher has to spend a lot of time and eort to perform a
simple analysis. Since these bioinformatics tools are not developed using IS
principles, they are not aligned with the user requirements. The main consequences
of this issue are:</p>
      <p>Some biological databases are only human readable, thus cannot be processed
properly in an automatic way.</p>
      <p>The extraction of relevant data is dicult because it is spread around
different databases.</p>
      <p>Since several tools are required to analyze the data, the specication of the
tooling workow and integration is far from trivial.</p>
      <p>Inclusion of new studies and bibliography into the available tools turns into
a hard task.</p>
      <p>
        With the goal of facing these issues, some researchers have proposed [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] the
development of Genomic information Systems (GIS), an IS specically designed
to handle a big amount of genomic data. In this work, a new approach to
develop GIS is proposed: the use of conceptual models to organize genomic data
and guide the development. Thanks to the close collaboration with biologists in
the context of this research project, the gap between the disciplines of Software
Engineering and Genetics is solved. The result of this interdisciplinary
collaboration is a conceptual model that guides the alignment of concepts among both
elds. Therefore the design and implementation of the software artifacts that
made up a GIS becomes an easier process.
      </p>
      <p>Following that idea, this paper presents a GIS prototype that analyzes a DNA
sequence in order to nd documented variations for a specic gene. Once all
variations are located in the sequence, the prototype splits them in two groups:
one group contains harmless variations and the other one contains variations
that produce a change in gene or protein function. For those in the last group,
their specic phenotype is reported as it has been described in the literature.</p>
      <p>This information is bibliographically referenced and gathered in a report that
helps the researcher to understand the genetic meaning of the variation and why
it produces a certain phenotype. This is very useful because it can speed up
the diagnosis of a specic disease. Furthermore, it is widely accepted that an
early disease detection might be determinant. The main contribution of this
work is that the GIS development is supported by a set of conceptual model
entities that formalize the domain concepts related with genomic variations.
As a consequence, the conceptual model plays an integration role to provide
the genomic knowledge in an unambiguous way and independent from specic
datasource details.</p>
      <p>With that goal in mind, the rest of the paper is organized as follows. In
section 2 a review of DNA variation analysis tools is presented. Section 3 details
a conceptual model for describing genomic variations. Section 4 describes how the
variation analysis prototype has been developed. Finally, in section 5 conclusions
and future work are stated.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In recent years, several commercial tools have been developed to provide genomic
analysis. These tools can perform tests in order to estimate the customer
probability to suer certain diseases. Navigenics [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], 23andMe [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and deCODEme [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
are the most relevant tools in this eld. The dierences between them are briey
summarized in Table 1. However, the accuracy of these tools is far from ideal.
Results are not reported in an unambiguous way because biological concepts
are not precisely dened. Without a conceptual model that guides the precise
denition of the domain, further integration with external tools is complex to
achieve.
      </p>
      <p>Another drawback of these tools is that the only variations reported are
SNPs (Single Nucleotide Polymorphism). The conceptual model improves the
reports quality because other complex variations such as repetitive insertions or
deletions are classied. Furthermore the diseases detected by these commercial
tools are constrained to the number of supported genes. The use of a conceptual
model overcomes this constraint because provides guidelines to support several
gene sequence references and their new discovered variations.</p>
    </sec>
    <sec id="sec-3">
      <title>A Conceptual Model for Describing Variations</title>
      <p>The main objective of the conceptual model presented in this paper is to
establish a connection point between the genomic eld and the GIS development
domain. One of the main characteristics of the genomic eld is heterogeneity.
The unication of the relevant concepts is a dicult task, since genomic
concepts are not precisely dened. Moreover, the eld knowledge is still developing
and these concepts are constantly evolving, making the organization of all the
genetic data available more dicult.</p>
      <p>
        Genetic databases are thus aected by this heterogeneity problem. Each
database reects the concepts according to the interpretation and terminology
of a biologist. However, there are dierent denitions for the same concept; for
example, a variation in the DNA sequence is referred under the terms: variation,
mutation, polymorphism or SNP [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Even though all of them represent more or
less the same concept, there are slight dierences among them. The problem of
heterogeneous data can be solved with the use of conceptual models, as some
works have proposed [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The development of a conceptual model to represent
the human genome is a useful approach to understand this complex domain since
precise concepts are dened and related among them. If new concepts, relations
or changes are discovered, they can be easily incorporated into the model.
      </p>
      <p>
        The conceptual model presented here claims to be precise with genetic
concepts and IS principles because it has been developed by software engineers and
biologists specialized in the genomic eld. The model presented in this section
is focus on the description of genomic variations. However, it is an excerpt of
a widest one [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], whose main goal is the specication of the required human
genome concepts for developing GIS.
Variant. Allelic Reference Type models the reference sequence that denesa
"universal" gene to be used for comparison purposes. These reference sequences are
extracted for trusted data sources as RefSeqGene database [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Allelic Variant
represents a DNA sequence of an individual which has several variations from
the allelic reference.
      </p>
      <p>Each variation discovered by means of the comparison process performed over
a sequence, is modeled by the Variation entity (2). The Variation entitiy stores
all the variations documented in the genetic literature that are associated to some
disease or to normal changes because of the intrinsic nature of an individual. This
entity has two specializations: Precise variations, which dene a variation that
is completely located and Imprecise variations, whose location details are not
specied. Precise variations are also categorized in four entities according to the
change performed in the sequence : Insertion, Deletion, Indel (insertion/deletion)
and Inversion. An indel can be categorized as SNP as well when it occurs at
least in 1% of the population.</p>
      <p>A variation that is specied in the model is always related to its phenotype,
which is modeled by the Phenotype entity (3). The Certainty entity species
the probability that a phenotype could show up because of a concrete variation
on the genotype. In case is identied a genotype-phenotype association, it is
essential to know information about the bibliographic reference and the original
database where the discovery was stated. This data is dened by the
Bibliography Reference and BibliographyDB entity (4) respectively. As a rst result of
this conceptual model, a genetic database (GDB) has been created to store the
variation information that is used by the presented GIS prototype.
4</p>
    </sec>
    <sec id="sec-4">
      <title>A GIS Proof of Concept: a Variation Analysis Prototype</title>
      <p>
        The main goal of the prototype is to show how conceptual models can be useful
to dene a GIS. One of the most common tasks in the genomic area is the
analysis of the genomic sequences [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Researchers perform the analysis by doing a
comparison between a certain DNA sample from a concrete gene and its
reference sequence. The comparison is done using an alignment tool that shows a list
of dierences among them. After that, an experienced researcher has to decide
which variations are relevant and which not. Then, they have to dive into the
vast and non-structured amount of information that is scattered across the Web
and search the bibliography that justies each relevant variation. Performing
this work manually is a tedious and time consuming task.
      </p>
      <p>The proposed prototype reduces this time by automating the major part of
the manual work. This automation can be done thanks to the conceptualization
of the domain by the presented conceptual model. Data such as genes, variations,
phenotypes and bibliographic references is now represented as perfectly dened
conceptual entities. Thanks to this conceptualization, heterogeneity and data
dispersion problems are solved, avoiding the manual preprocess of some
noncomputer legible data and ensuring the quality of the data stored.</p>
      <p>The purpose of the presented GIS prototype is to receive a DNA sample
from a patient and provide a report that helps the doctor to diagnose a certain
disease. The experts only have to introduce the sample in the suitable format
and review the provided results, forgetting everything about manual treatment
and endless searches across the bibliography.</p>
      <p>
        The analysis process performed by the prototype is summarized in gure 2.
Some conceptual model entities that are used in the dierent steps are depicted
in white rectangular boxes. The process is divided into ve main steps:
1. Input data: The biologist selects a gene from the set supported by the
prototype, for instance the BRCA1 gene, and introduces the DNA sample to
be analyzed. The input of the sample can be performed manually or by
uploading a le in FASTA format.
2. Alignment report: According to the selected gene, the prototype locates the
suitable reference using the allelic reference entity. After that, an alignment
process between the sample and the reference is carried out for nding
variations. This alignment is performed using the BLAST algorithm [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], however
importing results from DNA sequencing tools as Sequencher [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] will be
supported in next versions. Using the dened conceptual model, each discovered
dierence is formalized as an instance of the variation entity. This
formalization, which it is not present at the moment in other tools or databases, is
independent of the output from any alignment tool and provides a suitable
way for exchanging variations. A report that summarizes all the changes is
generated using these variation entities.
3. Variation knowledge: Thanks to the report generated in the previous phase
the classication problem is simplied. Variations are located according to
a well-know reference sequence and their positions match with the genomic
data stored in GBD. Then, each variation is queried into the GDB to
determine if it has been dened as a precise variation. If a variation cannot be
found in our GDB is classied as unknown. At this point, known variations
are classied into an specic type of sequence change. Unknown variations
are classied as non-silent if the variation produces a change, in other words,
an eect in the expected gene product (protein).
4. Phenotype Assessment: Variations classied as known may have some
phenotype associated. In order to asses if the phenotype is related to an specic
disease, a research publication is required to provide a trustful evidence. For
those cases, the conceptual model describes the bibliographical reference that
supports the phenotype for an specic variation. In the context of this work,
variations with a pathogenic phenotype are classied as mutations whereas
they are classied as SNPs if no negative phenotype is described.
5. Report creation: All the obtained information is gathered in a report. This
report contains information about the variations found: mutations, variations
whose phenotype is not a disease and unknown variations. Each variation is
provided with the following information: the location where it was found in
the sequence, its type (Insertion, Deletion, Indel or Inversion) and the
number of nucleotides inserted or deleted. For the mutations found in the GDB
their associated phenotype and its bibliography is added as well. Finally, the
report le can be saved as a text document.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>This work proposes a GIS engineering solution in order to solve the problems of
heterogeneity on the genomic domain. A conceptual model is presented which
describes and denes formally the concepts related to genomic variations. As a
proof of concept, a GIS prototype, which uses this conceptual model as
background, has been implemented.</p>
      <p>One of the advantages of using the presented GIS prototype is that the
variation analysis can be performed using only one tool, avoiding the data workow.
In addition, using a conceptual model to guide the development simplies the
acquisition of the genetic data and can be precisely linked to the bibliography.</p>
      <p>However, the study of the prototype performance working with real DNA
samples must be analyzed. In order to fulll this task, further studies related
with sequencing algorithms and tools will be carried out.</p>
      <p>
        Conceptual modeling of genes is not a completely novel research area. Some
works [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] to organize the genomic data have also been proposed
before. The main contribution of the presented work is that the conceptual model
proposed here is specically designed to guide the implementation of software
artifacts using a model-driven development approach.
      </p>
      <p>As further work it is planned to extend the GIS prototype with the aim of
achieving a higher accuracy and to facilitate the input of sequences. As a nal
goal, the GIS prototype will be tested in a real environment by means of a
collaboration with IMEGEN, a genomic medicine institute, and a couple of local
hospitals.</p>
      <p>Acknowledgments
This research work has been developed with the Generalitat Valenciana support
under the project ORCA (PROMETEO/2009/015)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Gilbert</surname>
            ,
            <given-names>D.G.</given-names>
          </string-name>
          :
          <article-title>Eugenes: a eukaryote genome information system</article-title>
          .
          <source>Nucleic Acids Research</source>
          <volume>30</volume>
          (
          <year>2002</year>
          )
          <fpage>145148</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Navigenics. http://www.navigenics.com (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. 23andMe. https://www.23andme.
          <string-name>
            <surname>com</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. deCODEme. http://www.decodeme.com (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Irizarry</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bolstad</surname>
            ,
            <given-names>B.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Collin</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cope</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hobbs</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Speed</surname>
            ,
            <given-names>T.P.</given-names>
          </string-name>
          :
          <article-title>Summaries of aymetrix genechip probe level data</article-title>
          .
          <source>Nucleic Acids Research</source>
          <volume>31</volume>
          (
          <year>2003</year>
          ) e15
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Klein</surname>
          </string-name>
          , R.:
          <article-title>Power analysis for genome-wide association studies</article-title>
          .
          <source>BMC Genetics</source>
          <volume>8</volume>
          (
          <year>2007</year>
          )
          <fpage>58</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. den Dunnen, J.T.,
          <string-name>
            <surname>Antonarakis</surname>
          </string-name>
          , E.:
          <article-title>Nomenclature for the description of human sequence variations</article-title>
          .
          <source>Human Genetics</source>
          <volume>109</volume>
          (
          <year>2001</year>
          )
          <fpage>121124</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Richesson</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Turley</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          :
          <article-title>Conceptual models: Denitions, construction, and applications in public health surveillance</article-title>
          .
          <source>Journal of Urban Health</source>
          <volume>80</volume>
          (
          <year>2003</year>
          ) i128
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Pastor</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levin</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Casamayor</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Celma</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villanueva</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eraso</surname>
            ,
            <given-names>L.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alonso</surname>
            ,
            <given-names>M.P.</given-names>
          </string-name>
          <article-title>Enforcing conceptual modeling to improve the understanding of human genome</article-title>
          .
          <source>Research Challenges in Information Science (RCIS</source>
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. NCBI:
          <article-title>The RefSeqGene project</article-title>
          . http://www.ncbi.nlm.nih.gov/RefSeq/RSG (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Stevens</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goble</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baker</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brass</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A classication of tasks in bioinformatics</article-title>
          .
          <source>Bioinformatics</source>
          <volume>17</volume>
          (
          <year>2001</year>
          )
          <fpage>180188</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Altschul</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gish</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Myers</surname>
            ,
            <given-names>E.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lipman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Basic local alignment search tool</article-title>
          .
          <source>Journal of Molecular Biology</source>
          <volume>215</volume>
          (
          <year>1990</year>
          )
          <fpage>403410</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Gene Codes Corporation.: Sequencher. http://www.genecodes.com (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Consortium</surname>
            ,
            <given-names>T.G.O.</given-names>
          </string-name>
          :
          <article-title>Gene ontology: tool for the unication of biology</article-title>
          .
          <source>Nature genetics 25</source>
          (
          <year>2000</year>
          )
          <fpage>2529</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Paton</surname>
            ,
            <given-names>N.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hayes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moussouni</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brass</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eilbeck</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goble</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hubbard</surname>
            ,
            <given-names>S.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliver</surname>
            ,
            <given-names>S.G.</given-names>
          </string-name>
          :
          <article-title>Conceptual modelling of genomic information</article-title>
          .
          <source>Bioinformatics</source>
          <volume>16</volume>
          (
          <year>2000</year>
          )
          <fpage>548557</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Ram</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Toward Semantic Interoperability of Heterogeneous Biological Data Sources</article-title>
          .
          <source>In: Advanced Information Systems Engineering</source>
          . Springer Berlin / Heidelberg (
          <year>2005</year>
          )
          <fpage>32</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>