<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Linked Functional Annotation For Di erentially Expressed Gene (DEG) Demonstrated using Illumina Body Map 2.0</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alokkumar Jha</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yasar Khan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Muntazir Mehdi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aftab Iqbal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Achille Zappa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ratnesh Sahay</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dietrich Rebholz-Schuhmann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Insight Centre for Data Analytics</institution>
          ,
          <addr-line>NUI Galway</addr-line>
          ,
          <country>Ireland Galway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Semantic Web technologies are core for the integration of disparate data resources. It can be used to exploit data from next generation sequencing (NGS) for therapeutic decisions regarding cancer. In this manuscript, we describe how di erent data resources, which inform on the expression of speci c genes in a tissue and its variants, can be brought together to indicate a risk for tissue-speci c cancer for NGS data. This approach can be used to judge patient genomic data against public reference data resources. The TCGA and COSMIC repositories are being processed to connect and query information concerning the expression of genes, copy number variants (CNV), and somatic mutations. We annotated sets of di erential expression data provided from the Illumina Body map 2.0 (HBM) concerning 16 di erent tissue types and identify genes with an RPKM (Reads Per Kilobase of transcript per Million mapped reads) value greater than 0.5 as measure indicating an associated risk for cancer. Thus, the di erential expressed genes from HBM can be associated with a tissue type and gene expressions in COSMIC and TCGA leading to a potential biomarker for that particular tissue speci c cancer. In the case of ovarian cancer, we retrieved the genomic positions (loci) and the associated genes of potential biomarker candidates, and suggest that this approach and platform can serve future studies well. Altogether, the presented linked annotation platform is the rst approach to represent the COSMIC data in an RDF format and to link the data with the TCGA datasets. The proposed approach enriches mutations by lling in missing links from COSMIC and TCGA datasets which in turn helped to map mutations with associated phenotypes.</p>
      </abstract>
      <kwd-group>
        <kwd>Di erentially Expressed genes(DEG)</kwd>
        <kwd>Linked data</kwd>
        <kwd>Clinical genomics</kwd>
        <kwd>Copy Number Variation (CNV)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Next Generation Sequencing (NGS) technologies open new diagnostic and
therapeutic ways for cancer research. However, the resulting high-throughput
sequencing data has to be processed in complex data analytics pipelines including
annotation services. Unfortunately, there is not yet a well-integrated platform
available for both clinical and translational [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] research to ful ll these
annotation and analytical tasks. In addition, the large volumes of NGS data poses
another challenge, since the computational infrastructure for the biological
interpretation will have to cope with very large quantities of data originating from
clinical facilities. Last, but not least, the functional annotation of genomics data
for cancer has to take tissue speci city into consideration and thus has to avoid
ambiguity while aggregating clinical outcomes from disparate resources. In this
paper we focus on exploring gene expression patterns across di erent cancer and
tissue types. Our experiments are based on semantic integration of gene
expression, CNV, complete mutation data from two disjoint resources, i.e. COSMIC1
and TCGA2. By doing this we can assist in variant and mutation prioritization
using 16 di erent tissue types given by the Illumina Body Map 2.0 and evaluated
in a case study for Ovarian cancer.
      </p>
      <p>In order to link and retrieve patterns of a gene and tissue speci c information
from various cancer mutation (TCGA) and database with global mutation list
and mutation type (COSMIC), we encountered the following three challenges:
(i) to transform heterogeneous data repositories and their storage formats into
standard RDF; (ii) to discover associations (aka. links) by nding speci c
patterns (i.e. correlations) for a gene with regards to CNV, mutation and its gene
expression data sets; and (iii) to query in a scalable way the large volume and
frequently updating datasets covering 16 di erent tissue types and the gene
expression data from di erent repositories.</p>
      <p>The experiments conducted in this paper is aligned to the transcriptome
study based on the Human Body Map 2.0 (HBM)3 from Illumina which covers
the following tissues: adrenal, adipose, brain, breast, colon, heart, kidney, liver,
lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood
cells. The HBM provides gene speci c information across one or more tissue types
and intends to support the identi cation of potential biomarker for targeted
therapy. In this study our results not only depicts novel biological outcomes but
also provides a linked annotation framework that assimilates clinical outcomes
from related data repositories.</p>
      <p>The rest of the paper is structured as follows: Section 2 motivates our working
scenario exploring on the HBM use case and the annotation databases; Section 3
presents the methodology and architecture of the proposed functional annotation
framework; Section 4 gives an evaluation of the functional annotation framework;
Section 6 presents the related work in linking the TCGA repository and Section 7
draws the conclusion from our work.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Motivation</title>
      <p>In order to understand the outbreak of disease, in particular cancer, it is one
approach to compare normal and diseased tissue samples to interpret the changes in
the expression patterns of the genes with regards to the observed disease status.
1http://cancer.sanger.ac.uk/cosmic
2https://tcga-data.nci.nih.gov/tcga/
3https://www.ebi.ac.uk/gxa/experiments/E-MTAB-513
In our case, HBM serves the purpose to identify similarities in gene expression
patterns using the studies across di erent tissue types, where HBM discloses
the similarities between human tissues on the molecular and genetic level. Due
to overlaps between cancer behaviours, progression and mutated genes, we have
annotated top 100 genes distilled by our ltering criteria with COSMIC to
explore previously observed studies from TCGA database, e.g., somatic mutation,
genomic loci and other mutations linked to these genes retrieved from healthy
tissues.</p>
      <p>
        Human Body Map (HBM) 2.0 from Illumina: HBM covers data from
transcriptome studies for 16 tissue types (see above). Samples for these 16
tissue types have been processed, aligned and nally expression level have been
determined [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Sequencing has been performed to provide both paired-end and
single-end libraries (read-length of 50bp and 75bp). Therefore, the data
processing platform requires a list of di erentially expressed genes as input, which is
the outcome of the RNA seq data analysis pipeline.
      </p>
      <p>
        The gene expression data extracted from HBM samples returns a very large
list of more than 52000 genes. For data processing reasons we chose to reduce
the list and therefore de ned the cut o for each RPKM value according to the
method suggested by Sandberg et. al[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. As a result, the data for each tissue
type includes both the coverages and the RPKM values as the corresponding
expression level. In addition, the RNA seq data set provides further relevant data
such as CNV, fusion genes, structural variation, di erentially expressed genes,
novel mutations, splice junctions and trascriptome variations [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Identifying the
associations and relations between these datasets, i.e. the logical connections,
enables further insightful research into the cancer disease the biological and
clinical interpretation of given data.
      </p>
      <p>Annotation databases (COSMIC &amp; TCGA): The main focus of this work
is the identi cation of patterns for cancer mutations (given by TCGA) and
globally known mutations and their types (given by COSMIC) for selected di
erentially expressed genes across di erent tissue types. Figure 1 shows the
correspondences, i.e., the associations or connections that have been established between
the TCGA and COSMIC databases for this purpose. For this task our primary
concern have been the associations between the CNV, the known mutations and
the gene expression data.</p>
      <p>GENE_EXPRESSION
CompositeElement
REF Protein
Expression</p>
      <p>COMPLETE_MUTATION
Hugo_Symbol
Entrez_Gene_Id
ChromosomeStart_Position
UID
Validation Status
……</p>
      <p>CNV(COPYNUMBER)
Chromosome
Start
End
Segment_Mean</p>
      <p>GENE_EXPRESSION
ID_SAMPLE
SAMPLE_NAME
GENE_NAME
REGULATION</p>
      <p>COMPLETE_MUTATION
Genename
Samplename
Tumourorigin
TCGA_ID
ID_tumour
Samplesource
…….</p>
      <p>CNV(COPYNUMBER)
ID_SAMPLE
ID_tumour
Primarysite
Histology subtype
MUT_TYPE
Chromosome:G_Start. G_Stop</p>
      <p>Fig. 1: Links between COSMIC and TCGA repositories</p>
      <p>As part of this work, speci c basic curation for data re nements have been
performed: we had to identify instances to link two databases or a couple of
events within the databases (see g. 1). For example, MUTATION and GENE
EXPRESSION data in COSMIC could be linked to GENE NAME but CNV
had SAMPLE IDs as expected. Later we used SAMPLE IDs after rst iteration
with GENE NAME. Also, chr:start end position and GENE NAME were used
to link COSMIC and TCGA (see green arrows in Figure 1). The RDFized
version (see section 3.1) has kept this redundancy problem to have FDR rate as
low as possible.</p>
    </sec>
    <sec id="sec-3">
      <title>3 Methodology &amp; Architecture</title>
      <p>The annotation architecture is summarized in Figure 2 showing all three major
components. First, the RDFization component that generates Linked Data from
the TCGA and COSMIC databases leading to several SPARQL endpoints for
public use. Second, the linking component that searches and discovers
correspondences between selected datasets (TCGA variants, COSMIC di expression,
COSMIC Mutation, etc.). The links discovered by this component have an
e ect on the e ciency in the source selection, on the query planning, and on
the overall query execution in a decentralized setting. Third, the scalable query
federation component: it a single-point-of-access through which distributed data
sources can be queried in concerto.</p>
      <p>Illumina Body Map 2.0
Differentially expressed genes (DEG)</p>
      <p>SPARQL Query</p>
      <p>SAFE Query Federation Engine</p>
      <sec id="sec-3-1">
        <title>Source</title>
      </sec>
      <sec id="sec-3-2">
        <title>Selection</title>
      </sec>
      <sec id="sec-3-3">
        <title>Query</title>
      </sec>
      <sec id="sec-3-4">
        <title>Planning</title>
      </sec>
      <sec id="sec-3-5">
        <title>Query</title>
      </sec>
      <sec id="sec-3-6">
        <title>Execution</title>
        <p>Results
SumDmataaries</p>
        <sec id="sec-3-6-1">
          <title>Access</title>
        </sec>
        <sec id="sec-3-6-2">
          <title>Policy Model</title>
          <p>itrsvaaT_CnAG
iiIfrssxS__eeCCpdnoOM
ittSa_CCunoO
M
I
M
iitrrssxT__eeeCpnnpooAG
s
e
ir
o
it
s
o
p
e
R
C
I
M
S
O
C
d
n
a
A
G
C
T
d
e
k
n
i
L</p>
        </sec>
      </sec>
      <sec id="sec-3-7">
        <title>COSMIC</title>
        <p>GENE_NAME
Mutation_ID
Sample_ID
Tumor_ID
Chr_start
Chr_end
TCGA_ID</p>
        <p>GENE_EXPRESSION
ion COMPLETE_MUTATION
t CNV(COPYNUMBER)
a
z
i
FRD CCThhCrrG__Aset_naIdrDt</p>
        <p>Variant_type
Ref_Protien
Expression
TCGA</p>
        <p>
          The scalable query federation is based on the SPARQL query federation
engine called SAFE [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], which has been developed for accessing distributed clinical
trial repositories. SAFE has been adapted to improve the e cient integration of
data from the di erent TCGA and COSMIC SPARQL endpoints. More speci
cally, SAFE makes use of a favourable distribution of data to reduce the number
of sources required for processing federated SPARQL queries (without
compromising recall). This approach is based on the principle that integrated data
sources allow querying of multiple data sources in a single search, independently
of their status being distributed or centralized, whereas traditional methods of
data integration rather map the data models to a single, uni ed, model. Such
methods tend to resolve syntactic di erences between models, but do not
address possible inconsistencies in the concepts de ned in those models. Semantic
integration resolves the syntactic heterogeneity present in multiple data models
as well as the semantic heterogeneity among similar concepts across those data
models.
3.1
        </p>
        <p>RDFization
COSMIC raw data les are delivered as tab separated text (tsv) and are being
processed with the COSMIC RDFizer tool that generates the N3 triples for
the SPARQL endpoint and statistical information related to the data. Only
three types of data have been included, i.e., gene expression data, gene mutation
and CNV data. Table 1 shows the overall statistics of the RDFization: row 1
represents for the COSMIC gene expression data the number of records (column
2), it's size (column 4), the corresponding triples generated (column 2) and again
it's size. The other two rows represent the same type of data for the COSMIC
gene mutation and CNV data. A total of 154 million records has been RDFized,
producing approximately 1.2 billion triples. Row 5 represents the statistics for
the RDF version of TCGA-OV (TCGA Ovarian), which forms a subset of the
linked TCGA4 data. The RDF le for the COSMIC data can be made available
inline with the COSMIC data policy.
The main integration is based on owl:sameAs constructs as can be seen in the
listing 1.1 where two COSMIC sample ids have been identi ed as being identical
to two TCGA patient bar code ids. These links are at the core of facilitating
data integration and the data analysis tasks.</p>
        <p>Listing 1.1: COSMIC and TCGA Linking Example
&lt;Link 1&gt;
&lt;Source&gt;COSMIC&lt;/Source&gt;
&lt;Target&gt;TCGA OV&lt;/Target&gt;
&lt;l i n k &gt;
http : / / cosmic . s e l s . i n s i g h t . org / schema / ID Sample /TCGA 13 0920
owl : sameAs
http : / / t c g a . d e r i . i e /TCGA 13 0920
&lt;/l i n k &gt;
&lt;/Link 1&gt;
&lt;Link 2&gt;
&lt;Source&gt;COSMIC&lt;/Source&gt;
&lt;Target&gt;TCGA OV&lt;/Target&gt;
&lt;l i n k &gt;
http : / / cosmic . s e l s . i n s i g h t . org / schema / ID Sample /TCGA 24 1850
owl : sameAs
http : / / t c g a . d e r i . i e /TCGA 24 1850
&lt;/l i n k &gt;
&lt;/Link 2&gt;
3.3</p>
        <p>
          Scalable query federation
SAFE has been developed for accessing sensitive clinical data in data cubes at
di erent locations [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Two main changes have been introduced to SAFE for
e ciently querying the TCGA and COSMIC SPARQL endpoints. First,
standardize RDF query representation: in the initial versions, SAFE issues queries
for statistical clinical information stored within distinct names graphs for RDF
data cubes [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Therefore, the internal query processing (i.e., source selection,
query planning, query execution) had to be adapted to query the regular
RDFized versions of the TCGA and COSMIC repositories. Second, access control
had to be disabled: SAFE imposes restrictions for data-access as a feature
(dened as Access Policy Model [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]) while federating queries over multiple clinical
site, i.e. imposing the data restrictions for di erent data repositories. Since,
experiments conducted in this paper mainly involve public repositories this feature
has been disabled.
        </p>
        <p>The listing 1.2 shows a sample SPARQL query, which federates across
COSMIC and TCGA data asking for genomic loci of a mutated gene by
chromosome start points which then returns the disease metastasis information
along with the mutation type. Answering such a query requires the
integration of COSMIC with TCGA and merging results from both TCGA and
COSMIC, and thus has to make use of query federation. The results for the rst
four triples in the given query (i.e. cosmic-s:ID Sample, cosmic-s:Gene Name,
cosmic-s:Chrom start) are fetched from COSMIC and the results for the next
three triples (i.e. tcga:tcga id, tcga:start) are fetched from TCGA. To
produce the required information, both results are merged on the basis of the last
triple which integrates COSMIC with TCGA. Sample results for this query can
be seen in Figure 5.</p>
        <p>Listing 1.2: Federated SPARQL Query
PREFIX cosmic s : &lt;http : / / cosmic . s e l s . i n s i g h t . org / schema/&gt;
PREFIX t c g a : &lt;http : / / t c g a . d e r i . i e /&gt;
PREFIX owl : &lt;http : / /www. w3 . org /2002/07/ owl#&gt;
SELECT WHERE f
? c o s m i c r e s u l t a cosmic s : r e s u l t ;
cosmic s : ID Sample ? i d s a m p l e ; cosmic s : Gene Name ? gene ;
cosmic s : c h r o m s t a r t ? c h r o m s c o s m i c .
? t c g a r e s u l t a t c g a : r e s u l t ;</p>
        <p>t c g a : Id ? t c g a i d ; t c g a : S t a r t ? t c g a c h r o m s t a r t .
? i d s a m p l e owl : sameAs ? t c g a i d .
g
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Biological questions and annotation results from HBM</title>
      <p>
        We analysed all genes have an RPKM &gt; 0.5 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and that are di erentially
expressed in all tissue types. Figure 3 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is a schematic representation which
satis es all mentioned conditions and delivers 99 genes per query. We have
identi ed potential cancer types based on gene patterns for di erent tissues and
further helped to understand the behaviour of most ampli ed cancer types.
The overall goal of this study is to understand the relevance of mutations and
genes along with their associated expression levels measured in data sets for
normal tissues e.g. HBM 2.0, and then evaluate (e.g., query) them against the
mutations retrieved and linked from somatic and patient speci c data e.g
COSMIC, TCGA. Further focus of this work is put to the linked annotations where
a single query can retrieve all other possibly relevant annotations.
      </p>
      <p>Initially, we have sampled 99 genes that are highly expressed in all 16
tissues as shown in Figure 3 to retrieve their CNV, mutation and gene expression
annotations from cBIO portal (for TCGA) and CNV annotator (for COSMIC)
to determine current state of the art and provide a baseline comparison for
the proposed linked annotation solution. The results for TCGA 4 clearly
indicate an elevated distribution of these genes in uterine and ovarian cancer with
a large number of mutations and CNVs. As an outcome, ovarian cancer has
been selected as a good candidate for further investigation due to its elevated
ampli cation rate and its multiple repetition in di erent experiments. Further
studies have been retrieved that were conducted to understand the somatic
relevance and the loci(genomic position) of these genes, and further detailed
information for mutations could be retrieved as well. This study demonstrated
a focus to genes such as ACTC1, B2M, CRP, FABP3, FABP4, FGA, FGB,
GC, MYH7, RPRH2, SLC26A3, TG, TXNIP which form most relevant driver
genes transforming healthy human tissues into ovarian cancer. The same 99
genes have been queried against 99 genes from HBM 2.0 to get the results for
the somatic mutations in cancer. This repeat annotation will not only provide
detailed statistics reported in COSMIC but also a validation for out earlier
experiments. Table 2 clearly indicates that locations chr1,chr4,chr14 and genes
CRP,FGA,FABP3,MYH7 could be potential genes with high relevance for
the development of ovarian cancer.</p>
      <p>80%
cy 70%
eun 60%
freq
itona 50%
r
e
ltA 40%
30%
20%
10%
0%
Cancertype
Mutationdata + + + + + + + + + + + + + + + + - + + + + - - + + + + + + + + + + + + + + + + + + + +</p>
      <p>CNVdata + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
UterineCS(OTCvaGrAia)Mne(TlaCnGomA)a(TBClaGddABe)lard(TdCerG(ATLC)unGgALaupdnuegbn)aod(eTnCoG(ATC)LGuAngpusbq)u(TCLGivAOe)vra(rTiaCnGS(ATtoC)mGaAchpu(TbC)GSAtompLuaubcn)hg(sTqCuG(HATeC)aGdHA&amp;epanudebc&amp;)kn(eTcCkG(ATC)GABpreuabs)tS(TaCrcGomAU)ate(rTinCeG(ATC)GAUtpeuribn)eP(aTnCcGreAa)s(CTeCrGviAcEa)slo(TpChaGgAu)s(TCDGLBABC)re(TaCstG(ATC)GPAropsutabt)eC(oTloCrGPecrAota)slta(TteCC(GToAClo)GreAct2a0l1(T5C)GAccpRuCbC)(TGCliGomA)a(TCGGBAM)(TCGACAC)(TCpRGCAC)GB(TMC(GTAC)GAc2h0R1C3cC)cR(TCCCG(ATC)GAPpCuPbG)(TcChRGCATC)hy(rToCidG(ATC)GTAhpyurobi)dA(TMCLG(ATC)GApAuMb)GLB(TMC(GTAC)GA2008)</p>
      <p>Mutation</p>
      <p>Deletion</p>
      <p>Amplification Multiple alterations</p>
      <p>
        Fig. 4: TCGA query output from cBIO Portal[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
      </p>
      <p>We now query the same 99 genes from before using the linked annotation
mechanism. The result snippet for ovarian cancer is depicted in Figure 5 with
detailed functional annotations together with the TCGA ids, again for ovarian
cancer.</p>
      <p>cs:result cs:Sample_Name c:TCGA-13-0920-01;
cs:Gene_Name c:MYH7; ts:result ts:bcr_patient_barcode t:TCGA-13-0920;
cs:Regulation "over". ts:beta_value "0.0419..";
cs:chr_no c:1; ts:chromosome t:14;
cs:chrom_start_m c:23418303; ts:chromosome t:1;
cs:chrom_stop_m c:23418303; ts:start t:1288070;
cs:chr_no_m c:14; ts:stop t:1293914;
cs:mut_type "GAIN"; ts:scaled_estimate "773.555".
cs:Primary_Site c:ovary;
cs:Primary_Histology c:carcinoma;</p>
      <p>Fig. 5: Linked annotations for MYH7 - COSMIC</p>
      <p>
        Figure 5 represents the COSMIC and TCGA annotations, respectively.
MYH7 corresponds to chr-1 which is evident from previous annotations and
replicated again in our study along with its TCGA ID:TCGA-13-0920-01. Its
mutation type is primarily the GAIN type of a mutation for chr1 and chr14 which
is a dominant mutation with all its regulation of over, under and normally
expressed. Translational researchers may want to repeat and re-validate the study
for Pubmed ID:1398522 additionally with beta value (measure of methylation)
of 0.041999536 and scaled estimation (Tumour purity) of 773.555 also supports
this gene from the epigenetic point of view. Further multiple genomic locations
will help clinical practitioners to track CNV for the targeted study and which
ultimately leads the direction towards a better prognosis.
Kandoth et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] performed a cancer study with 12 cancer types to enable
logical classi cations for the large amoung of data generated by TCGA and
ICGC. Saleem et. al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] have covered TCGA database with few cancer types
and for a limited number of patient data.
      </p>
      <p>
        Likewise a reduced version of the COSMIC database has been RDFized to
explore on the mechanism of TP53 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] further for CNV explains the linked
infrastructure to annotated CNV. The federation platform [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] called \TopFed"
is being developed to measure the query execution time on TCGA data set, which
then has been further extended to cover the biological outcomes identi ed from
Medline abstracts [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Our work covers all CNVs, mutations and gene expression
data and has been extended with TCGA for the same type of data, thus forming
a proof of concept for an annotation platform that covers comprehensive linked
life science data. It is important to note that the work presented in this paper
is a preliminary approach for transforming COSMIC into the RDF format and
link it with the TCGA datasets.
7
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper we have presented a linked data infrastructure for functional
annotation which enables querying di erent types of mutations and genomic
alterations to contribute to molecular and clinical insights of cancer by de ning
most relevant variants and their prioritization. This knowledge could be highly
advantageous for a targeted therapy and personalized medicine based on gene
expression data. The presented experiments are based on TCGA, COSMIC and
HBM 2.0 datasets and have been used to identify sets of genes with relevance for
ovarian cancer and with comprehensive set of mutations. Similar studies have
to be performed for other cancer types. We have covered CNV, gene expression
and mutation data from COSMIC and TCGA (only for ovarian cancer). We
have processed 1.2 billion COSMIC triples and 100 million TCGA triples which
in turn generated 27 GB of data. In future, this work will be expanded to cover
level 1, 2 and 3 along with other datasets from COSMIC to provide in-depth
biological insight for each queried gene.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgment</title>
      <p>This publication has emanated from research supported by the research grant
from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Y. W.</given-names>
            <surname>Asmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. M.</given-names>
            <surname>Necela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. R.</given-names>
            <surname>Kalari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hossain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Baker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Carr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Getz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hostetter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          , et al.
          <article-title>Detection of redundant fusion transcripts as biomarkers or disease-speci c therapeutic targets in breast cancer</article-title>
          .
          <source>Cancer research</source>
          ,
          <volume>72</volume>
          (
          <issue>8</issue>
          ):
          <year>1921</year>
          {
          <year>1928</year>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. E. bioinformatics institute.
          <source>Illumina body map 2</source>
          .
          <article-title>0 european bioinformatics institute</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Carroll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hayes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Stickler</surname>
          </string-name>
          .
          <article-title>Named graphs, provenance and trust</article-title>
          .
          <source>In WWW</source>
          , pages
          <volume>613</volume>
          {
          <fpage>622</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Crowley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zhabotynsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. K.</given-names>
            <surname>Pakatci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Morgan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Calaway</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Aylor</surname>
          </string-name>
          , et al.
          <article-title>Analyses of allele-speci c gene expression in highly divergent mouse crosses identi es pervasive allelic imbalance</article-title>
          .
          <source>Nature genetics</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Aksoy</surname>
          </string-name>
          , U. Dogrusoz,
          <string-name>
            <given-names>G.</given-names>
            <surname>Dresdner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. O.</given-names>
            <surname>Sumer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jacobsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Larsson</surname>
          </string-name>
          , et al.
          <article-title>Integrative analysis of complex cancer genomics and clinical pro les using the cbioportal</article-title>
          .
          <source>Science signaling</source>
          ,
          <volume>6</volume>
          (
          <issue>269</issue>
          ):pl1{
          <fpage>pl1</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>C.</given-names>
            <surname>Kandoth</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. D. McLellan</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Vandin</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Niu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>J. F.</given-names>
          </string-name>
          <string-name>
            <surname>McMichael</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Wyczalkowski</surname>
          </string-name>
          , et al.
          <article-title>Mutational landscape and signi cance across 12 major cancer types</article-title>
          .
          <source>Nature</source>
          ,
          <volume>502</volume>
          (
          <issue>7471</issue>
          ):
          <volume>333</volume>
          {
          <fpage>339</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Iqbal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mehdi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Decker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Sahay</surname>
          </string-name>
          .
          <article-title>SAFE: policy aware SPARQL query federation over RDF data cubes</article-title>
          .
          <source>In Proceedings of the 7th International Workshop on Semantic Web Applications and Tools for Life Sciences</source>
          , Berlin, Germany, December 9-
          <issue>11</issue>
          ,
          <year>2014</year>
          .,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramskold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. B.</given-names>
            <surname>Burge</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Sandberg</surname>
          </string-name>
          .
          <article-title>An abundance of ubiquitously expressed genes revealed by tissue transcriptome sequence data</article-title>
          .
          <source>PLoS Comput Biol</source>
          ,
          <volume>5</volume>
          (
          <issue>12</issue>
          ):
          <fpage>e1000598</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>M.</given-names>
            <surname>Saleem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Kamdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Iqbal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sampath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. F.</given-names>
            <surname>Deus</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          .
          <article-title>Big linked cancer data: Integrating linked tcga and pubmed</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          ,
          <volume>27</volume>
          :
          <fpage>34</fpage>
          {
          <fpage>41</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>M. Saleem</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          <string-name>
            <surname>Kamdar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Iqbal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sampath</surname>
            ,
            <given-names>H. F.</given-names>
          </string-name>
          <string-name>
            <surname>Deus</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.-C.</given-names>
            <surname>Ngonga</surname>
          </string-name>
          .
          <article-title>Fostering serendipity through big linked data</article-title>
          .
          <source>Semantic Web Challenge at ISWC</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>M. Saleem</surname>
            ,
            <given-names>S. S.</given-names>
          </string-name>
          <string-name>
            <surname>Padmanabhuni</surname>
            ,
            <given-names>A.-C. N.</given-names>
          </string-name>
          <string-name>
            <surname>Ngomo</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          <string-name>
            <surname>Almeida</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Decker</surname>
            , and
            <given-names>H. F.</given-names>
          </string-name>
          <string-name>
            <surname>Deus</surname>
          </string-name>
          .
          <article-title>Linked cancer genome atlas database</article-title>
          .
          <source>In Proceedings of the 9th International Conference on Semantic Systems</source>
          , pages
          <fpage>129</fpage>
          {
          <fpage>134</fpage>
          . ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>J. Xuan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Qing</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Guo</surname>
            , and
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Shi</surname>
          </string-name>
          .
          <article-title>Next-generation sequencing in the clinic: promises and challenges</article-title>
          .
          <source>Cancer letters</source>
          ,
          <volume>340</volume>
          (
          <issue>2</issue>
          ):
          <volume>284</volume>
          {
          <fpage>295</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>A.</given-names>
            <surname>Zappa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Splendiani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Romano</surname>
          </string-name>
          .
          <article-title>Towards linked open gene mutations data</article-title>
          .
          <source>BMC bioinformatics</source>
          ,
          <volume>13</volume>
          (
          <issue>Suppl 4</issue>
          ):
          <fpage>S7</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>