<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using Metadata for Locating Genomic Datasets on a Global Scale</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anna Bernasconi Dipartimento di Elettronica</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Milan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy anna.bernasconi@polimi.it</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Format bed DataType peak</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Management view Case Project</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Technology view Container</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Genomic research bene tted from recent extraordinary improvements in DNA sequencing techniques, leading to the production of enormous amounts of datasets that store information such as nucleotide sequences, gene locations/levels of expression, proteins-DNA interactions. As this has now become a big data matter, characterized by an underlying disorganization, there is a strong need for integrative solutions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>In this paper, we devote our e orts to the
management of genomic data, to be
organized and located using experimental
studies descriptions. Such documentation, also
referred to as metadata, contains
fundamental information to understand the content of
experimental samples (namely, how the
biological material was extracted and processed,
in which clinical conditions, with which
techniques.) We propose a novel framework to
manage metadata of genomic datasets, o
ering a uni ed view with respect to a number
of heterogeneous data sources (usually big
international consortia, but also small research
centers) that currently display their metadata
in disorganized and very cumbersome formats.
The nal outcome of this work is a search
platform which allows easy location of
relevant sources for speci c genomic data analysis
problems.</p>
      <p>Copyright © CIKM 2018 for the individual papers by the papers'
authors. Copyright © CIKM 2018 for the volume as a collection
by its editors. This volume and its papers are published under
the Creative Commons License Attribution 4.0 International (CC
BY 4.0).</p>
      <p>Chromosome 1
1 16 68 94
((chr, start, stop, strand), (proteinID, alignID, type))
((chr1, 1, 16, +), ('uc001aaa.3', 'uc001aaa.3', 'cds'))
((chr1, 68, 94, +), ('uc001aaa.3', 'uc001aaa.3', 'exon'))
((chr1, 137, 145, +), ('uc001aaa.3', 'uc001aaa.3', 'exon'))
((cchhrr11,, 61,8,196,4,++)) ((''Ruucce00g00i11oaanaaaad..a33''t,,a''uucc000011aaaaaa..33'',, ''cedxso'n)’)
(chr1, 137, 145, +) ('uc001aaa.3', 'uc001aaa.3', intron')
biosample_typeMetadatacell line
biosample_term_name MCF-7
baisossaaymple_tissue bCrheIaPs-tseq
donor.organism.name Homo Sapiens</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>Genomic research is blooming because of
revolutionary technologies to sequence DNA (Next Generation
Sequencing), which operate at much faster rates and
lower costs than traditional techniques. Such speed-up
is achieved by means of massively parallel sequencing,
which enables millions of nucleic acids fragments to
be handled simultaneously. A single human genome,
about 3 billion units of DNA in 23 thousands genes,
can now be processed in just a single day and stored
in around 200 Gigabytes [CCK+17].</p>
      <p>Because of thousands of new experimental datasets
becoming available every day, genomics has become
a new \big data" generator (see [SLF+15] for
comparison with other major big data domains). To
boost further research, this wealth of data needs to
be made available for search and download.
Currently, it is distributed across a range of
worldwide repositories (nearly 1,000 sequencing centers
in 55 countries in universities, hospitals, and other
research laboratories), usually coordinated by
national research consortia and institutes.
Organizations such as the International Cancer Genome
Consortium (ICGC, [ZBC+11]), the National Cancer
Institute Genomic Data Commons (GDC, [JFGS17]),
the National Center for Biotechnology Information
(NCBI, [Coo17]), the National Human Genome
Research Institute (NHGRI, [Man16]), and the
European Bioinformatics Institute (EBI, [LVAL07])
maintain and enrich the repositories of genomic data, that
may contain both open and controlled data (i.e., only
accessible upon approval from a Data Access
Committee). Public data are bene cial to researchers and
clinicians who can access and compare them, as well
as search for common patterns across a large number
of individual.</p>
      <p>While data sources have more or less agreed on the
de nition of protocols and formats for data production
and transformation, no convergence has been observed
for common metadata formats. The existing
repositories propose their own standards which are only used
internally, presume a thorough knowledge of their
speci c rules, and require tedious manual work to allow
for use of data from combined sources.</p>
      <p>Considering only publicly available data, we focused
on the need of the genomic research community for
a tool which helps to locate and retrieve interesting
data to solve biological and clinical questions and also
favours data interoperability. We propose a metadata
storage system, speci c for genomic datasets, with a
four-fold contribution: 1. gradual inclusion of all
processed datasets from sources considered interesting for
tertiary analysis (i.e., data analysis in charge of
\maging sense" of genomic signals [Gab10]); 2.
integration of genomic data residing in these heterogeneous
data sources to provide a uni ed view of the
comparable concepts; 3. curated representations of metadata,
maintained coherent with the current status of the
original sources; 4. user-friendly search functionality,
based on key-words characterizing the samples, but
also on their synonyms and hypernyms, which are
retrieved through specialized ontologies.</p>
      <p>Fig. 1 illustrates the story behind our e ort. The
genomic problem can be broken down into a sequence
of computational steps. The genome material (e.g.,
a DNA fragment), by means of an experimental
sequencing technique (i.e., ChIP-seq), can be translated
rst in reads (through primary analysis
methodologies), then alignments, signals, and nally regions (by
means of secondary analysis methods).</p>
      <p>Within the GeCo Project1, described in [CBC+17],
we use a machine readable representation, which
includes \Region data" and the related \Metadata"
les. Region data consists of quadruples such as
(chr1,1,16,+), which identi es the region contained
in chromosome 1 of the human genome, spanning from
coordinates 1 to 16 w.r.t. a reference genome, and
being located in the positive strand of the double helix
structure of DNA. Metadata, instead, contain
information about the genomic experiment which generated
the data.</p>
      <p>In this paper we propose a system which, after
submitting metadata through a data integration pipeline, as a
nal step exposes them by means of a user interface|
similar to the one shown at the bottom of the Fig. 1|
ready for querying.</p>
      <p>With our system, we aim to encourage the use of
genomic datasets, allowing easier semantically
enriched search and resulting download of processed
data. We have previously proposed GMQL [MCP+18],
a high-level query language for genomics, and
GDM [MKPC16], an integrative model for processed
data formats. Using the system described here in
combination with the query language and execution engine
implemented within the GeCo Project, we aim to help
support the speci c processes of retrieval, exploration,
and analysis of genomic data.</p>
      <p>The paper is structured as follows. Section 2
introduces metadata usefulness with a motivating example.
Section 3 overviews the overall system which integrates
data, driven by the use of the Genomic Conceptual
Model. Section 4 explains how we allow novel searches
over the database of genomic experiments through a
web interface. Section 5 brie y mentions related works
in the literature. Finally, Section 6 concludes the
paper.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Motivating Example</title>
      <p>Genomic datasets are typically characterized by
explanatory information that can be consulted on the
interfaces of data sources; sometimes they are available
for download in various semi-structured formats.
Generally, aspects described by metadata can be clustered
in the following areas: clinical information regarding
the physical individual who has donated the biological
sample extracted for sequencing; bio specimen
information about the tissue (or cell culture) of provenance
and the possible pathologies that a ect the biological
material; the technologies (e.g., platforms),
methodologies (i.e., pipelines), and processes used to sequence
1Data-Driven Genomic Computing,
bioinformatics.deib.polimi.it/geco/
http://www.
Gene Expression Omnibus
Source name
Organism
Characteristics</p>
      <p>T47D-MTVL
Homo Sapiens
gender: female
tissue: breast cancer ductal carcinoma
ENCODE
Assay: ChIP-seq Assay: ChIP-seq
Target: MYC Target: MYC
Biosample: Homo sapiens MCF-7 Biosample: Homo sapiens MCF-10A
Biosample Type: cell line Biosample Type: cell line
Description: Mammary gland, adenocarcinoma Description: Mammary gland, non-tumorigenic cell line
Health status: Breast cancer (adenocarcinoma)Health status: Fibrocystic disease
the DNA, to align the sequences, and to further
produce DNA regions; the formats and data types, which
describe the new shape of data, de ning what kind
of information it delivers; details on the organization
aspects that include the program, project, and case
study under which the experiment is being conducted.
All these aspects are memorized by data sources in
various ways. Heterogeneity spans from download
protocols and formats to attributes names and values.
To motivate our e ort towards an integrated platform,
we introduce an example which simulates the research
of data suitable for a genomics project. For
illustration purposes, we include just bio specimen
information, leaving aside technological and clinical aspects.
Consider a comparison study between a human
nonhealthy breast tissue, su ering from carcinoma, and
a healthy sample coming from a similar tissue. A
researcher in the eld, due to previous experience, knows
three portals to locate interesting data for this
analysis. The results obtained after some browsing are
reported in Fig. 2.</p>
      <p>For the diseased data, describing gene expression,
the chosen source is GDC Data Portal, an important
repository on human cancer mutation data. As it can
be seen on the top of Fig. 2, one or more cases (i.e.,
datasets) can be retrieved by composing a query which
allows to locate variation data on \Breast Invasive
Carcinoma" from \Breast" tissue.</p>
      <p>To compare such data with references, the researcher
chooses additional datasets coming from cell lines, i.e.,
cell cultures which have been permanently established
and made immortal. Since cell lines are considered a
standard for similar investigations in the past, they
are frequently used in place of primary cells to study
biological processes. The scienti c community tends
to accept the derived ndings more readily.
A tumor cell line data is found on the GEO web
interface (middle rectangle of Fig. 2) where, by
browsing thousands of samples, the researcher locates one
from \Homo Sapiens" organism, where the analyzed
cell type is \T47D-MTVL" and observed disease is
\breast cancer ductal carcinoma". On ENCODE,
instead, the researcher chooses both a tumor cell line
(bottom left of Fig. 2) and a normal cell line
(bottom right), to make a control check. \MCF-7" is a
cell line started from a diseased tissue a icted with
\Breast cancer (adenocarcinoma)", while \MCF 10A"
is its widely considered non-tumorigenic counterpart.
Note that considerable external knowledge is
necessary in order to nd these connections, which cannot
be obtained on the mentioned portals. Concerning the
disease choice: \breast invasive carcinoma" is the same
as \breast carcinoma" (as observed in the annotation
from EBI's Expression Atlas [JB15]), which allows to
compare GDC's data with the datasets from GEO and
ENCODE, since they describe more speci c diseases
(i.e., \breast cancer (adenocarcinoma)" and \breast
cancer ductal carcinoma" are its sub-types,
according to the Disease Ontology [KAF+14]). Concerning
the cell lines choice: researchers typically query
speci c databases (such as the cell line browser of the
Catalogue Of Somatic Mutations In Cancer2) or
dedicated forums to discover tumor/normal matched cell
line pairs. This information is not encoded in a unique
way over sources and is often missing.
3</p>
    </sec>
    <sec id="sec-4">
      <title>Integration Procedure</title>
      <p>During the design phase we considered four data
sources: Genomic Data Commons (GDC, [JFGS17]),
containing over 310,000 les, across over 32,000 cases,
in 40 projects, covering many aspects of cancer
genomics; the Encyclopedia of DNA Elements
(ENCODE, [rE12]), with almost 420,000 les, allocated in
over 15,000 experiments, part of 6 di erent projects,
related to the functional DNA sequences which
intervene at the protein/RNA levels and to the
regulatory elements which control gene expression; the
Gene Expression Omnibus (GEO, [BWL+13]), an
international public repository of high throughput gene
expression (and other) data sets submitted by the
research community, linked to almost 20,000
published manuscripts; Roadmap Epigenomics Project
(REP, [KME+15]) containing 1,936 datasets related
to genetic variation in association with human
disease based on epigenomics evidence. Other three data
sources have been used to validate the approach and
we plan to add many others as future work.
The conceived metadata integration process is
designed as an incremental procedure. First, we perform
2https://cancer.sanger.ac.uk/cell\_lines/</p>
      <sec id="sec-4-1">
        <title>Download + Transformation</title>
      </sec>
      <sec id="sec-4-2">
        <title>Cleaning</title>
      </sec>
      <sec id="sec-4-3">
        <title>Mapping</title>
      </sec>
      <sec id="sec-4-4">
        <title>Normalization</title>
      </sec>
      <sec id="sec-4-5">
        <title>Enrichment</title>
        <p>GDC
ENCODE</p>
        <p>GEO
…</p>
        <p>K V
K V
K V
…</p>
        <p>…
K V
K V
K V</p>
        <p>Donor BioSample</p>
        <p>Item</p>
        <p>Synonyms</p>
        <p>XRefs
Control ed
terms
&lt;term_id&gt;</p>
        <p>Ontological
relatives
Raw Metadata</p>
        <p>Clean Metadata</p>
        <p>Mapped metadata</p>
        <p>Normalized Metadata</p>
        <p>Enriched Metadata
The values mapped into the global schema are then
normalized. Normalization acts to ensure metadata
consistency at the semantic level. This phase involves
linking values to controlled vocabularies or biomedical
ontologies, typically manually curated by expert
curators. Fig. 5 shows a biological sample entity from
the global schema. From the original source only the
information in the blue solid boxes is retrieved. This
is then completed through normalization, which adds
the information in the red dashed boxes. For
example, the disease information \Breast cancer
(adenocarcinoma)" is equipped with a synonym \Mammary
adenocarcinoma" and DOID:3458, the corresponding
concept identi er in the Disease Ontology [KAF+14].
Finally, values are enriched by means of external
ontologies. During this phase, values that have
superconcepts or sub-concepts in the biomedical ontologies
are enriched with all concepts in a is a relationship
within three steps in the ontology graph (see
information in the green dotted boxes in Fig. 5). For
example, the value \breast", corresponding to the attribute
Tissue, is enriched by both its super-concept \Female
reproductive gland" and its sub-concept \Mammary
duct", among others. Details of normalization and
enrichment pipelines are available in [BCCC18].</p>
        <sec id="sec-4-5-1">
          <title>Tissue: breast</title>
          <p>BTO:0000149
mammary part of chest
mammary region
is a
is a</p>
          <p>Female
reproductive
gland</p>
        </sec>
        <sec id="sec-4-5-2">
          <title>Type: cell line</title>
          <p>Mammary duct
The web platform ensures easy and fast location of
datasets from the considered set of repositories. We
provide the URL endpoint for download from our
system, when the dataset is available (as it was retrieved
from the original system or transformed into processed
data to make it suitable for tertiary analysis).
Otherwise, we provide the original source URL for download.
The laborious integration process is designed to make
data querying easier. An example instance of a user
query on our interface can be appreciated in the lower
part of Fig. 1. This query for genomic experiments
data works regardless of how requested values are
expressed. For example, due to the mapping e orts made
during the integration process, by using the DONOR
column Species, the user can also reach data that was
documented through alike concepts, such as organism,
rather than abbreviations, such as Sp., or words in
other languages, such as the italian equivalent specie.
Moreover, due to the normalization and enrichment
e orts made during the integration process, a search
for samples with donors of \Homo Sapiens" species
will result in a selection of samples which were marked
with this annotation or, alternatively, with synonyms
(e.g., \man", \Human"), abbreviations (e.g., \H.
sapiens"), misspellings (e.g., \Homo sapeins"), or even
sub-concepts (e.g., \Homo sapiens neanderthalensis").
Similarly, concept-based search holds also for other
attributes.</p>
          <p>To support these functionalities, the system rewrites
user queries to instrument wider searches, which also
cover synonyms, hyponyms, and other kinds of
similarities.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Related works</title>
      <p>Many works in literature use conceptual models in
the genomics|and more in general biomedical| eld.
However, they employ conceptual models' expressive
power to explain biological entities and their
interactions [WZR05, RPCV16]. Instead, we propose to use
a conceptual model as the driving principle to achieve
data integration.</p>
      <p>In the state of the art there have been
multiple attempts to o er integrated access to
heterogeneous sources. Some of these are:
BioKleisli [DOTW97] (to provide read access to complex
structured data), BioMart [SHD+15] (for biomedical
databases), NIF [GBM+08] (in the neuroscience eld),
and DATS [SGBRS+17] (for scienti c datasets in
general).</p>
      <p>Also some of the genomics consortia mentioned
earlier have provided methodologies to organize metadata
(see the BioProject database [BCG+12], Encode Data
Coordination Center [HSC+16], and Genomic Data
Commons [JFGS17]). However, these are not
frameworks which are general enough to make possible
including all genomic data sources, regardless of how far
apart the sub-areas on which their data focus.
Also DeepBlue [ALBL16], an interesting starting point
in terms of easy-to-use interfaces, only handles
epigenomic data (i.e., study of epigenetic modi cations on
the cell), a small area compared to the whole genomics.
DNADigest [KWR+16] is an e ort that investigates
the problem of locating genomic data to download for
research purposes. Their work di ers from ours since,
even allowing a dynamical and collaborative curation
of metadata, they only provide means to locate raw
data. Instead, we provide processed data ready to be
used for tertiary analysis.</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>Following the need to make genomic datasets and their
information collectively searchable, we are proposing
a framework to manage, integrate and enrich
semantically the experimental data documentation. We are
soon delivering an online platform for genomic data
querying driven by metadata, which will be
appreciated by the genomic research community. This will be
an important resource for: 1. conducting research
activities by using directly our processed data, available
from the data repository we are currently developing
as a major research project (further details are
omitted for anonymity reasons); 2. locating data through
the URL endpoints of the original data sources.
Acknowledgements
This research is funded by the ERC Advanced Grant
693174 GeCo (Data-Driven Genomic Computing),
2016-2021.
[ALBL16]
[BCCC18]
[BCCM17]
[BCG+12]
[BWL+13]</p>
      <p>Felipe Albrecht, Markus List, Christoph
Bock, and Thomas Lengauer. DeepBlue
epigenomic data server: programmatic
data retrieval and analysis of epigenome.</p>
      <p>Nucleic acids research, 44(W1):W581{
W586, 2016.</p>
      <p>Anna Bernasconi, Arif Canakoglu,
Andrea Colombo, and Stefano Ceri.</p>
      <p>Ontology-driven metadata enrichment
for genomic datasets. In International
Conference on Semantic Web
Applications and Tools for Life Sciences,
volume 2275. CEUR-WS, 2018.</p>
      <p>Anna Bernasconi, Stefano Ceri,
Alessandro Campi, and Marco Masseroli.
Conceptual modeling for genomics: Building
an integrated repository of open data.</p>
      <p>In International Conference on
Conceptual Modeling, pages 325{339. Springer,
2017.</p>
      <p>Tanya Barrett, Karen Clark, Robert
Gevorgyan, et al. BioProject and
BioSample databases at NCBI:
facilitating capture and organization of
metadata. Nucleic Acids Research,
40(D1):57{63, 2012.</p>
      <p>Tanya Barrett, Stephen E Wilhite,
Pierre Ledoux, et al. NCBI GEO:
archive for functional genomics data
[CBC+17]
[CCK+17]</p>
      <p>Mark A Jensen, Vincent Ferretti,
Robert L Grossman, and Louis M
Staudt. The nci genomic data commons
as an engine for precision medicine.</p>
      <p>Blood, 130(4):453{459, 2017.
[KAF+14]
[KME+15]
[KWR+16]
[LVAL07]
[Man16]
[MCP+18]
[MKPC16]
[rE12]
[RPCV16]</p>
      <p>Warren A Kibbe, Cesar Arze, Victor
Felix, et al. Disease ontology 2015 update:
an expanded and updated database
of human diseases for linking
biomedical knowledge through disease data.</p>
      <p>Nucleic acids research, 43(D1):D1071{
D1078, 2014.</p>
      <p>Anshul Kundaje, Wouter Meuleman,
Jason Ernst, et al. Integrative analysis of
111 reference human epigenomes.
Nature, 518(7539):317{330, 2015.</p>
      <p>Nadezda V Kovalevskaya, Charlotte
Whicher, Timothy D Richardson, et al.</p>
      <p>Dnadigest and repositive: connecting
the world of genomic data. PLoS
biology, 14(3):e1002418, 2016.</p>
      <p>Alberto Labarga, Franck Valentin,
Mikael Anderson, and Rodrigo Lopez.</p>
      <p>Web services at the european
bioinformatics institute. Nucleic acids research,
35(suppl 2):W6{W11, 2007.</p>
      <p>Teri A Manolio. Implementing genomics
and pharmacogenomics in the clinic:
The national human genome research
institute's genomic medicine portfolio.</p>
      <p>Atherosclerosis, 253:225{236, 2016.</p>
      <p>Marco Masseroli, Arif Canakoglu, Pietro
Pinoli, Abdulrahman Kaitoua, et al.</p>
      <p>Processing of big heterogeneous genomic
datasets for tertiary analysis of next
generation sequencing data.
Bioinformatics, page bty688, 2018.</p>
      <p>Marco Masseroli, Abdulrahman
Kaitoua, Pietro Pinoli, and Stefano
Ceri. Modeling and interoperability
of heterogeneous genomic big data for
integrative processing and querying.</p>
      <p>Methods, 111:3{11, 2016.</p>
      <p>Consortium ENCODE. An integrated
encyclopedia of DNA elements in the
human genome. Nature, 489(7414):57{74,
2012.</p>
      <p>Jose F Reyes Roman, Oscar Pastor,
Juan Carlos Casamayor, and Francisco
Valverde. Applying conceptual
modeling to better understand the human
genome. In International Conference
on Conceptual Modeling, pages 404{412.</p>
      <p>Springer, 2016.
[SHD+15]
[SLF+15]
[WZR05]
[ZBC+11]</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [SGBRS+17]
          <string-name>
            <surname>Susanna-Assunta</surname>
            <given-names>Sansone</given-names>
          </string-name>
          , Alejandra Gonzalez-Beltran,
          <article-title>Philippe RoccaSerra</article-title>
          , et al.
          <article-title>Dats, the data tag suite to enable discoverability of datasets</article-title>
          .
          <source>Scienti c data</source>
          ,
          <volume>4</volume>
          :
          <fpage>170059</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Damian</given-names>
            <surname>Smedley</surname>
          </string-name>
          , Syed Haider,
          <source>Ste en Durinck</source>
          , et al.
          <article-title>The BioMart community portal: an innovative alternative to large, centralized data repositories</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>43</volume>
          (
          <issue>W1</issue>
          ):
          <volume>589</volume>
          {
          <fpage>598</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Zachary D Stephens</surname>
            , Skylar Y Lee,
            <given-names>Faraz</given-names>
          </string-name>
          <string-name>
            <surname>Faghri</surname>
          </string-name>
          , et al.
          <article-title>Big data: astronomical or genomical? PLoS biology</article-title>
          ,
          <volume>13</volume>
          (
          <issue>7</issue>
          ):e1002195,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Liangjiang</given-names>
            <surname>Wang</surname>
          </string-name>
          , Aidong Zhang, and Murali Ramanathan.
          <article-title>BioStar models of clinical and genomic data for biomedical data warehouse design</article-title>
          .
          <source>Int. J. Bioinformatics Res. Appl.</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):
          <volume>63</volume>
          {
          <fpage>80</fpage>
          ,
          <string-name>
            <surname>April</surname>
          </string-name>
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Database</surname>
          </string-name>
          ,
          <year>2011</year>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>